CN111859894B - Method and device for determining scenario text - Google Patents

Method and device for determining scenario text Download PDF

Info

Publication number
CN111859894B
CN111859894B CN202010724600.8A CN202010724600A CN111859894B CN 111859894 B CN111859894 B CN 111859894B CN 202010724600 A CN202010724600 A CN 202010724600A CN 111859894 B CN111859894 B CN 111859894B
Authority
CN
China
Prior art keywords
text
aggregation
objects
adjacent
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010724600.8A
Other languages
Chinese (zh)
Other versions
CN111859894A (en
Inventor
郏昕
阳任科
赵冲翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202010724600.8A priority Critical patent/CN111859894B/en
Publication of CN111859894A publication Critical patent/CN111859894A/en
Application granted granted Critical
Publication of CN111859894B publication Critical patent/CN111859894B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The embodiment of the invention provides a method and a device for determining a scenario text, and relates to the technical field of data processing, wherein the method comprises the following steps: determining each text unit in the text; extracting the content characteristics of each text unit; performing feature matching on content features of text units adjacent to each other in the text, determining text units with similar content features, determining the text units with similar content features into the same unit cluster, and determining each unit cluster corresponding to the text; characters in text units included in each unit cluster are respectively determined as scenario text for describing each scenario in the texts. By applying the scheme provided by the embodiment of the invention to determine the scenario text in the text, the efficiency of determining the scenario text can be improved.

Description

Method and device for determining scenario text
Technical Field
The invention relates to the technical field of data processing, in particular to a scenario text determining method and device.
Background
The text of novels, scripts, etc. describe various episodes. Each episode is described by a continuous piece of characters, and thus, this continuous piece of characters can be referred to as episode text. In addition, for a novel or a scenario, whether the speed of the plot rhythm and the plot arrangement reasonably directly affect the attraction of the plot arrangement to the user, so that the plot described by texts such as novel, scenario and the like needs to be detected.
Before detecting episodes described in the text, it is necessary to determine the episode text corresponding to each episode in the text. In the prior art, when determining the scenario text in the text, the scenario text is generally identified manually by a worker, so that the efficiency of determining the scenario text is low.
Disclosure of Invention
The embodiment of the invention aims to provide a plot text determining method and device, so as to improve the efficiency of determining plot texts in texts. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a scenario text determining method, where the method includes:
determining each text unit in the text, wherein each text unit comprises characters of which the parts in the text are arranged continuously, and no intersection exists between the text units at the positions in the text;
extracting content characteristics of each text unit, wherein the content characteristics are as follows: features reflecting the content described by the text elements;
performing feature matching on content features of text units adjacent to each other in the text, determining text units with similar content features, determining the text units with similar content features into the same unit cluster, and determining each unit cluster corresponding to the text;
Characters in text units included in each unit cluster are respectively determined as scenario text for describing each scenario in the text.
In one embodiment of the present invention, the feature matching is performed on the content features of the text units adjacent to each other in the text, text units with similar content features are determined, the text units with similar content features are determined to the same unit cluster, and each unit cluster corresponding to the text is determined, including:
performing feature matching on object features of each pair of adjacent aggregation objects in the aggregation objects, wherein the aggregation objects are: the method comprises the steps that an operation unit is used for carrying out object aggregation, an initial object of each aggregation object comprises a text unit, each text unit corresponds to each initial object one by one, an initial value of an object characteristic of each aggregation object is a content characteristic of the text unit contained in the aggregation object, and text units contained in adjacent aggregation objects are adjacent in position in the text;
performing object aggregation on each pair of adjacent aggregation objects according to the matching result to obtain new aggregation objects;
obtaining the object characteristics of each new aggregation object according to the object characteristics of the aggregated object of each new aggregation object;
And under the condition that the preset aggregation termination condition is met, taking each new aggregation object as each unit cluster corresponding to the text.
In one embodiment of the present invention, the preset polymerization termination condition includes at least one of the following:
the polymerization times reach the preset times;
the new aggregate object is the same as the aggregate object before aggregation;
the number of the first target aggregate objects is larger than the number of the preset objects, and the first target aggregate objects are as follows: the number of characters contained is greater than the preset number of aggregate objects.
In one embodiment of the present invention, the performing feature matching on object features of each pair of adjacent aggregate objects in the aggregate objects includes:
aiming at the object characteristics of each type of the aggregation object, calculating the similarity of each pair of adjacent aggregation objects on the object characteristics as local similarity;
and obtaining the overall similarity of each pair of adjacent aggregate objects on all object characteristics according to the local similarity of each pair of adjacent aggregate objects.
In one embodiment of the present invention, the obtaining the overall similarity of each pair of adjacent aggregate objects on all object features according to the local similarity of each pair of adjacent aggregate objects includes:
Carrying out statistical calculation on the local similarity of each pair of adjacent aggregation objects to obtain an initial value of the overall similarity of each pair of adjacent aggregation objects;
and adjusting initial values of the overall similarity corresponding to each pair of adjacent aggregation objects according to the following expression to obtain the overall similarity of each pair of adjacent aggregation objects:
wherein W is the overall similarity, W 0 For the initial value of the overall similarity, a, b, c and d are preset parameters, and under the condition of calculating the similarity of the front aggregation object relative to the rear aggregation object in the adjacent aggregation objects, the setting is performed size For the number of text units contained in the aggregation object at the back end, in the case of calculating the similarity of the aggregation object at the back end in the adjacent aggregation objects relative to the aggregation object at the front end, the setting size Is the number of text units contained in the aggregate object at the front end.
In one embodiment of the present invention, the performing object aggregation on each pair of adjacent aggregation objects according to the matching result to obtain a new aggregation object includes:
selecting an adjacent aggregation object with highest overall similarity with the second target aggregation object aiming at each second target aggregation object, and carrying out object aggregation on the second target aggregation object and the selected adjacent aggregation object to obtain a new aggregation object, wherein the second target aggregation object is: the number of text units contained is less than a preset number of aggregate objects.
In one embodiment of the present invention, the performing feature matching on object features of each pair of adjacent aggregate objects in the aggregate objects includes:
determining adjacent aggregation objects, which do not contain target characters, of corresponding text units in the aggregation objects, wherein the target characters are as follows: a character indicating that a pair of adjacent aggregation objects cannot be subject to aggregation;
and performing feature matching on the determined object features of each pair of adjacent aggregation objects.
In one embodiment of the present invention, the content features include features for characters, and in a case where text characters are included in text content described by text units, the extracting the content features of the respective text units includes:
identifying character names of text characters in each text unit;
for each text unit, extracting features for the remaining characters, the remaining characters being: in the text unit, characters other than the character name.
In a second aspect, an embodiment of the present invention provides a scenario text determining apparatus, where the apparatus includes:
the unit determining module is used for determining each text unit in the text, wherein each text unit comprises characters of which the parts in the text are arranged continuously, and the positions in the text between each text unit are not intersected;
The feature extraction module is used for extracting the content features of each text unit, wherein the content features are as follows: features reflecting the content described by the text elements;
the unit cluster determining module is used for carrying out feature matching on the content features of the text units adjacent to each other in the text, determining similar text units with similar content features, determining the text units with similar content features into the same unit cluster, and determining each unit cluster corresponding to the text;
and the plot text determining module is used for determining characters in text units included in each unit cluster as plot texts used for describing each plot in the text.
In one embodiment of the present invention, the unit cluster determining module includes:
the feature matching sub-module is used for performing feature matching on object features of each pair of adjacent aggregation objects in the aggregation objects, wherein the aggregation objects are: the method comprises the steps that an operation unit is used for carrying out object aggregation, an initial object of each aggregation object comprises a text unit, each text unit corresponds to each initial object one by one, an initial value of an object characteristic of each aggregation object is a content characteristic of the text unit contained in the aggregation object, and text units contained in adjacent aggregation objects are adjacent in position in the text;
The object aggregation sub-module is used for carrying out object aggregation on each pair of adjacent aggregation objects according to the matching result to obtain new aggregation objects;
the feature obtaining submodule is used for obtaining the object feature of each new aggregation object according to the object feature of the aggregated object of each new aggregation object;
and the unit cluster determining submodule is used for taking each aggregation object as each unit cluster corresponding to the text under the condition that the preset aggregation termination condition is met.
In one embodiment of the present invention, the preset polymerization termination condition includes at least one of the following:
the polymerization times reach the preset times;
the new aggregate object is the same as the aggregate object before aggregation;
the number of the first target aggregate objects is larger than the number of the preset objects, and the first target aggregate objects are as follows: the number of characters contained is greater than the preset number of aggregate objects.
In one embodiment of the present invention, the feature matching sub-module includes:
a similarity calculation unit, configured to calculate, for each type of object feature of the aggregate object, a similarity of each pair of adjacent aggregate objects on the object feature as a local similarity;
and the similarity obtaining unit is used for obtaining the overall similarity of each pair of adjacent aggregation objects on all object characteristics according to the local similarity of each pair of adjacent aggregation objects.
In one embodiment of the present invention, the similarity obtaining unit is specifically configured to:
carrying out statistical calculation on the local similarity of each pair of adjacent aggregation objects to obtain an initial value of the overall similarity of each pair of adjacent aggregation objects;
and adjusting initial values of the overall similarity corresponding to each pair of adjacent aggregation objects according to the following expression to obtain the overall similarity of each pair of adjacent aggregation objects:
wherein W is the overall similarity, W 0 For the overall similarityInitial values, a, b, c and d are preset parameters, and under the condition of calculating the similarity of the aggregation object at the front end relative to the aggregation object at the rear end in the adjacent aggregation objects, the setting is performed size For the number of text units contained in the aggregation object at the back end, in the case of calculating the similarity of the aggregation object at the back end in the adjacent aggregation objects relative to the aggregation object at the front end, the setting size Is the number of text units contained in the aggregate object at the front end.
In one embodiment of the present invention, the object aggregation sub-module is specifically configured to:
selecting an adjacent aggregation object with highest overall similarity with the second target aggregation object aiming at each second target aggregation object, and carrying out object aggregation on the second target aggregation object and the selected adjacent aggregation object to obtain a new aggregation object, wherein the second target aggregation object is: the number of text units contained is less than a preset number of aggregate objects.
In one embodiment of the present invention, the feature matching sub-module is specifically configured to:
determining adjacent aggregation objects, which do not contain target characters, of corresponding text units in the aggregation objects, wherein the target characters are as follows: a character indicating that a pair of adjacent aggregation objects cannot be subject to aggregation;
and performing feature matching on the determined object features of each pair of adjacent aggregation objects.
In one embodiment of the present invention, the content feature includes a feature for a character, and in a case that the text content described by the text unit includes a text character, the feature extraction module is specifically configured to:
identifying character names of text characters in each text unit;
for each text unit, extracting features for the remaining characters, the remaining characters being: in the text unit, characters other than the character name.
In a third aspect, the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the first aspects when executing a program stored on a memory.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored therein, which when executed by a processor, implements the method steps of any of the first aspects.
In a fifth aspect, embodiments of the present invention also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method steps of any of the first aspects described above.
The embodiment of the invention has the beneficial effects that:
when the scheme provided by the embodiment of the invention is applied to determine the scenario text, feature matching is carried out according to the content features of the text units, and as the text units in the same scenario text are similar, namely, the matching degree between the text units is higher, namely, the text units of different scenario texts are dissimilar, namely, the matching degree between the text units is lower, adjacent and similar text units are determined to be the same unit cluster, thus, characters in the text units contained in each unit cluster are respectively the scenario text used for describing each scenario in the text, and each scenario text in the text can be determined through the scheme provided by the embodiment of the invention. Compared with the prior art, when the scheme provided by the embodiment of the invention is applied to determine the scenario text, manual participation is not needed, so that the efficiency of determining the scenario text in the text can be improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a first scenario text determining method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a second scenario text determination method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a third scenario text determination method according to an embodiment of the present invention;
FIG. 4 is a flowchart of a fourth scenario text determination method according to an embodiment of the present invention;
FIG. 5 is a flowchart of a fifth scenario text determination method according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart of a scenario text determination method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a first scenario text determining apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a second scenario text determining apparatus according to an embodiment of the present invention;
Fig. 9 is a schematic structural diagram of a third scenario text determining apparatus according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Because the technical problem of low efficiency exists when determining the plot text in the prior art, the embodiment of the invention provides a plot text determining method and device for solving the problem.
In one embodiment of the present invention, there is provided a scenario text determining method, the method including:
determining each text unit in the text, wherein each text unit comprises characters of which the parts in the text are arranged continuously, and no intersection exists between the text units at the positions in the text;
Extracting the content characteristics of each text unit, wherein the content characteristics are as follows: features reflecting the content described by the text elements;
performing feature matching on content features of text units adjacent to each other in the text, determining text units with similar content features, determining the text units with similar content features into the same unit cluster, and determining each unit cluster corresponding to the text;
characters in text units included in each unit cluster are respectively determined as scenario text for describing each scenario in the texts.
As can be seen from the above, when determining a scenario text, the embodiment of the present invention performs feature matching according to the content features of text units, since text units in the same scenario text are similar, that is, the matching degree between text units is higher, and text units in different scenario texts are dissimilar, that is, the matching degree between text units is lower, therefore, adjacent and similar text units are determined as the same unit cluster, and thus, characters in text units included in each unit cluster are scenario texts in texts for describing each scenario respectively, and therefore, each scenario text in the text can be determined through the scheme provided by the embodiment of the present invention. Compared with the prior art, when the scheme provided by the embodiment of the invention is applied to determine the scenario text, manual participation is not needed, so that the efficiency of determining the scenario text in the text can be improved.
The following describes a scenario text determining method and device provided by the embodiment of the invention through a specific embodiment.
Referring to fig. 1, an embodiment of the present invention provides a flowchart of a first scenario text determining method, and specifically, the method includes the following steps S101-S104.
S101: each text unit in the text is determined.
Wherein each text unit contains characters of which the parts in the text are arranged continuously, and no intersection exists between the text units at the positions in the text.
Specifically, the text units may be natural segments, chapters, fields, and the like.
In one embodiment of the present invention, the text units may be related to the types of texts, and may be different according to the types of texts. For example, the type of text may be letter type, novel type, script type, or the like.
When the text type is letter type, the text unit may be a natural segment in the text; when the text is of the novel type, the text unit may be a chapter in the text; when the text is of the script type, the text unit may be a scene in the text.
S102: the content features of each text unit are extracted.
Wherein, the content characteristics are as follows: reflecting the characteristics of the content described by the text element.
The text unit includes a large number of characters, and the content expressed by combining the characters can be called text content, the text content can reflect different information, and the reflected information can be used for distinguishing different text units as the content characteristics of the text units.
In one embodiment of the present invention, keyword extraction, semantic analysis, etc. may be performed on characters in each text unit, so that characteristics of text contents of each text unit, referred to as content characteristics, are extracted according to the processing result.
Specifically, the content features may include at least one of the following features: features for time, features for places, features for text characters, features for weather, features for items, features for characters, etc.
The content features described above may be represented in the form of vectors or data sets.
Wherein, the above-mentioned characteristic to time can be: features for time intervals, e.g., three years ago, three years later, etc., features for specific times, e.g., day, night, morning, afternoon, etc.
The above-mentioned location-specific features may be: features specific to a particular location, such as living room, bedroom, playground, forest, etc.
The text characters may be people, animals, etc. in the text.
The above characteristics for text roles may be: text characters contained in the text unit, drama of the text characters, and the like. Specifically, when the text is a script, the play of the text character may be determined according to the number of times the text character comes out, where the number of times the text character comes out may include the number of times the text character brings out a dialogue, and the number of times the text character makes an action. The drama of each text character can be distinguished according to the number of the field of each text character, for example, the text character with the highest field of the text character is determined as a main angle in the text, other text characters with the field of the text character with the number of the field of the text character of more than the preset number of the field of the text character are determined as match angles in the text, and the rest text characters are determined as other characters in the text.
The weather-specific features described above may be: features specific to specific weather, such as foggy weather, rainy weather, snowy weather, etc.
The above features for the article may be: characteristics specific to specific articles, such as umbrellas, cups, bicycles, and the like.
The character-specific features described above may be: the number of occurrences and the degree of importance of each character in the text unit.
The number of occurrences of the characters may be the number of occurrences of each vocabulary consisting of each character in the text unit, such as 5 times, 10 times, etc.
The importance degree of the above character may be: the importance of each word of each character in the text unit.
Specifically, the importance degree of each character string composed of each character in the text can be calculated by the tfidf model.
The calculation formula of the tfidf model is as follows:
wherein, tfidf is as described above i,j For the feature of the ith character string in the jth text unit, n i,j For the number of occurrences of the ith character string in the jth text unit, n k,j For the number of occurrences of the kth string in the jth text unit, k is 0-the total number of strings in the jth text unit, thus Σ k n k,j For the total occurrence number of all character strings in the jth text unit, D is the number of the text units, t i ∈d j Indicating that the jth text unit contains the ith character string, |{ j: t i ∈d j The number of text units containing the i-th character string is indicated by } |.
Specifically, for the seventh character string, the fewer text units containing the character string, the more rare the character string exists in each text unit, and each text unit can be distinguished by the rare character string, so the importance of the rare character string is often higher. Conversely, the more text units that contain the string, the more common the string is in each text unit, and it is difficult to distinguish each text unit by the commonly occurring string, so the more important the commonly occurring string is, such as "today", "eat", etc., and the more important the string is in each of a plurality of text units.
As can be seen from the formula, the fewer text units containing the ith character string,the larger the value of (2), the moreCan be used to calculate the importance level of the character string and at the same time +.>Representing the ratio of the number of occurrences of the ith character string in the jth text unit to the total number of occurrences of the character string in the text, and may be used to represent the number of occurrences of the ith character string in the jth text unit, thus calculating tfidf i,j The method can embody two kinds of information of the occurrence times and the importance degree of the characters and can be used for representing the characteristics aiming at the characters.
The calculation of the character features may be found in steps a-B of the following embodiments, which are not described in detail here.
S103: and performing feature matching on the content features of the text units adjacent to each other in the text, determining the text units with similar content features into the same unit cluster, and determining each unit cluster corresponding to the text.
Wherein each cell cluster includes at least one text cell therein.
In one embodiment of the invention, feature matching of the content features can be realized by calculating the similarity of the content features of the extracted text units, and then sequentially adjacent and similar text units in the text are determined according to the calculated similarity. For example, a text unit having a similarity higher than a preset similarity is determined as a similar text unit.
Specifically, feature matching can be performed on the content features of every two adjacent text units to obtain a matching result between every two adjacent text units, and then adjacent text units with similar content features represented by the matching result are determined as text units in the same unit cluster.
For example, the text includes a text unit 1, a text unit 2, a text unit 3, and a text unit 4 that are adjacent in sequence, and the text unit 1, the text unit 2, the text unit 3, and the text unit 4 are respectively subjected to feature matching, and if the content features of the text unit 1 and the text unit 2 are similar, the content features of the text unit 2 and the text unit 3 are similar, the content features of the text unit 3 and the text unit 4 are dissimilar, the text unit 1, the text unit 2, and the text unit 3 are determined to be the same unit cluster, and the text unit 4 is another unit cluster.
The respective clusters corresponding to the text may be determined in step S103A to step S103D, and will not be described in detail here.
S104: characters in text units included in each unit cluster are respectively determined as scenario text for describing each scenario in the texts.
As can be seen from the above, when determining a scenario text, the embodiment of the present invention performs feature matching according to the content features of text units, since text units in the same scenario text are similar, that is, the matching degree between text units is higher, and text units in different scenario texts are dissimilar, that is, the matching degree between text units is lower, therefore, adjacent and similar text units are determined as the same unit cluster, and thus, characters in text units included in each unit cluster are scenario texts in texts for describing each scenario respectively, and therefore, each scenario text in the text can be determined through the scheme provided by the embodiment of the present invention. Compared with the prior art, when the scheme provided by the embodiment of the invention is applied to determine the scenario text, manual participation is not needed, so that the efficiency of determining the scenario text in the text can be improved.
In one embodiment of the present invention, in the case that text characters are included in text contents of text unit descriptions, character-specific features of respective text units may be extracted through the following steps a to B.
Step A: character names of text characters in respective text units are identified.
In particular, the character names of text characters in a text unit may be identified with reference to character strings in a known corpus that are labeled as character names. For example, the known corpus may be an encyclopedia corpus, a news corpus, or the like.
And (B) step (B): for each text unit, extracting features for the remaining characters, the remaining characters being: in the text unit, characters other than the character name.
Since the character names of the text characters are likely to be unusual names in the case that the text is a novel text, a script or the like, if the word segmentation is directly performed on each text unit, each character string in the obtained text unit is affected by the unusual names, so that part of characters in the unusual names are contained in the character strings obtained by the word segmentation, and therefore, the character-specific features are extracted after the character names in the characters are removed, and the accuracy of the obtained character-specific features can be improved.
In one embodiment of the present invention, in the case where a text character is included in the text content of the text unit description, features for the remaining characters may be extracted through steps B1 to B8.
Step B1: a first character string capable of forming a vocabulary in characters other than the character name is extracted.
Specifically, word segmentation processing may be performed on characters except for character names in the text unit, so as to obtain each character string capable of forming a vocabulary, which is called a first character string.
For example, a first character string among characters other than the character name may be extracted using a mechanical word segmentation method.
Step B2: and determining a second character string with the occurrence number larger than the lowest preset occurrence number and smaller than the highest preset occurrence number in the first character string.
Wherein the number of occurrences is the number of occurrences of the character string in the text unit. For example, 50 times, 60 times, etc. The minimum preset number of occurrences may be 10 times, 15 times, etc., and the maximum preset number of occurrences may be 60 times, 70 times, etc.
Because the character strings with the too high occurrence frequency can be common words such as yes, good and the like, the information content in the common words is less, so that the importance degree in the text is lower, the information content in the text is also less in the character strings with the too low occurrence frequency, and the importance degree in the text is lower, so that the determined character strings are the character strings with the occurrence frequency larger than the lowest preset occurrence frequency and smaller than the highest preset occurrence frequency, the character strings with the too high occurrence frequency and the character strings with the too low occurrence frequency can be removed, only the character aiming at the character with the higher importance degree is extracted, and the efficiency of determining the character aiming at the character can be improved.
Step B3: extracting character strings with parts of speech of nouns and verbs in the second character string as third character strings.
Since the character strings with parts of speech as nouns and verbs are highly likely to be used as subjects, predicates and objects in the sentences of the text, and the amount of contained information is high, only the features of the character strings with parts of speech as nouns and verbs in the text unit are extracted, and the efficiency of extracting the features of the characters can be improved.
Step B4: and removing the character strings belonging to the stop word list in the third character string to obtain a fourth character string.
Specifically, the disabling vocabulary may be a string table composed of strings with a lower preset importance level.
For example, the stop words may include character strings of low importance such as "good", "yes".
Step B5: and removing punctuation marks in the fourth character string to obtain a fifth character string.
Since punctuation marks do not contain specific information in a text, the importance degree in the text is low, and only the features of characters other than the punctuation marks in a text unit are extracted, so that the efficiency of extracting the features of the characters can be improved.
Step B6: and removing the character strings belonging to the special vocabulary in the fifth character string to obtain a sixth character string.
The term "specific word list" may refer to a character string list composed of specific term character strings that are not related to a specific scenario of a text description.
For example, when the text is a script, the character string included in the special vocabulary may be a special term character string such as "voice-over", "flashback", or when the text is a novel, the vocabulary included in the special vocabulary may be a special term character string such as "chapter one", "chapter two".
Step B7: and determining a character string which does not belong to the character name in the sixth character string according to the character name as a seventh character string. Since word segmentation errors may occur during the process of extracting the first character string, the first character string includes character strings composed of characters in the character names, and therefore the seventh character string may be determined from the sixth character string according to the character names.
For example, when the character name includes "Wang Jiangjiang", the character strings such as "Wang Jiangjiang", "Wang Jiang" and "Jiang Jiang" in the sixth character string are removed, and a seventh character string is determined.
Step B8: features for each seventh string are extracted.
Specifically, features for the seventh string may be extracted by a tfidf model.
Referring to fig. 2, a flow chart of a second scenario text determination method is provided, and the aforementioned step S103 may be implemented by the following steps S103A-S103D, compared to the aforementioned embodiment shown in fig. 1.
Since the unit clusters determined through one round of feature matching may not be able to aggregate all text units with similar content features into the same unit cluster according to the content features of the text units, multiple rounds of feature matching, that is, multiple rounds of text unit aggregation, may be performed. Since many concepts are involved in the multi-round text unit aggregation process, in order to distinguish various information and facilitate description, in the following embodiments, a concept of an aggregate object is introduced, each set of text units is called each aggregate object before determining to obtain each unit cluster, after multiple object aggregation, text units contained in each aggregate object may change after each object aggregation until a preset aggregation termination condition is met, and each aggregate object is determined to be each unit cluster.
S103A: and performing feature matching on object features of each pair of adjacent aggregation objects in the aggregation objects.
Wherein, the aggregate object is: and the initial value of the object characteristic of each aggregation object is the content characteristic of the text unit contained in the aggregation object, and the text units contained in adjacent aggregation objects are adjacent in position in the text.
Specifically, since each scenario is described by a section of continuous characters in the text, the feature matching is performed on two adjacent aggregation objects when the feature matching is performed, so that the two adjacent aggregation objects can be aggregated into a new aggregation object when the features of the two adjacent aggregation objects match each other, since the text units contained in the adjacent aggregation objects are continuous in the text, the text units contained in the generated new aggregation objects are also continuous in the text, and therefore, when each aggregation object is determined as each unit cluster, the text units contained in each unit cluster are also continuous in the text, the characters contained in the unit cluster are continuous in the text, and the characters used for describing one scenario in the text are determined as the characters used for describing each scenario are continuous in the text.
In contrast, if feature matching is performed on two non-adjacent aggregation objects, if the two aggregation objects are matched with each other, the two aggregation objects are aggregated into a new aggregation object, text units contained in the generated new aggregation object are discontinuous, if the new aggregation object is determined to be a unit cluster, characters contained in the determined unit cluster are discontinuous, if the characters contained in the unit cluster are determined to be scenario text for describing a scenario in the text, the scenario is not described by a continuous segment of characters in the text, so that feature matching is not required on the two non-adjacent aggregation objects, and feature matching is performed only on each pair of adjacent aggregation objects, thereby reducing the calculation amount in the process of feature matching the aggregation objects.
Since a plurality of text units may be included in the text, it may also be considered that there are a plurality of aggregate objects mentioned in this step, and in the initial state, each aggregate object includes one text unit, that is, each text unit is a text unit that is initially included in each aggregate object.
For example, the text unit 1 and the text unit 2 are adjacent, the aggregate object 1 initially contains the text unit 1, the initial value of the object feature of the aggregate object 1 is the content feature of the text unit 1, the aggregate object 2 initially contains the text unit 2, the initial value of the object feature of the aggregate object 2 is the content feature of the text unit 2, and since the text unit 1 is adjacent to the text unit 2, the aggregate object 1 and the aggregate object 2 are adjacent aggregate objects.
Specifically, feature matching may be performed on object features of each pair of adjacent aggregation objects among the aggregation objects through the following steps S103A1 to S103A2, which will not be described in detail herein.
S103B: and carrying out object aggregation on each pair of adjacent aggregation objects according to the matching result to obtain new aggregation objects.
Specifically, if the matching result of the adjacent aggregation objects indicates that the object features of the adjacent aggregation objects are similar, the adjacent aggregation objects are matched with each other, that is, the adjacent aggregation objects are similar objects, so that the adjacent aggregation objects can be subject to aggregation, and the adjacent aggregation objects are determined to be aggregation objects in the same scenario text.
If the matching result of the adjacent aggregation objects indicates that the object features of the adjacent aggregation objects are dissimilar, the adjacent aggregation objects are not matched, that is, the adjacent aggregation objects are dissimilar objects, so that the adjacent aggregation objects do not need to be subjected to object aggregation.
In one embodiment of the present invention, feature matching may be performed by calculating the similarity of object features of adjacent aggregate objects, and adjacent aggregate objects having a similarity greater than a preset similarity may be considered as adjacent aggregate objects that are matched with each other.
For example, the preset similarity may be 70%, 80%, or the like.
In addition, in addition to the aggregation objects at the beginning and the end of the text, the aggregation object at the middle of the text has two adjacent aggregation objects, which are respectively located at the front end and the rear end of the aggregation object, and when the similarity between the aggregation object and the two adjacent aggregation objects is greater than the preset similarity, the aggregation object and the adjacent aggregation object with the highest similarity can be subject to aggregation, so as to determine the new aggregation object.
For example, if the preset similarity is 70%, the similarity between the aggregate object 1 and the aggregate object 2 is 80% and the similarity between the aggregate object 2 and the aggregate object 3 is 75%, the aggregate object 1 and the aggregate object 2 are subject to object aggregation, and the new aggregate object is determined.
S103C: and obtaining the object characteristics of each new aggregation object according to the object characteristics of the aggregated object of each new aggregation object.
Specifically, the object features of the aggregated object may be directly combined into the object features of the new aggregated object.
For example, the feature of the object 1 to be aggregated is daytime, the feature of the place is playground and classroom, the feature of the text character is Zhang three and Liu four, the feature of the object 2 to be aggregated is night, the feature of the place is classroom and road, and the feature of the text character is Liu four and Wang five.
When the object 1 to be aggregated and the object 2 to be aggregated are aggregated into a new object, the event-specific feature of the new object to be aggregated is day and night, the place-specific feature is playground, classroom and road, and the text character-specific feature is Zhang three, lifour and Wang five.
In addition, in the case where the character-oriented feature is included in the object feature, the character-oriented feature of the new aggregation object can be obtained through the above-described steps a to B.
Furthermore, the object features of the aggregated object may be de-duplicated and then combined into the object features of the new aggregated object.
For example, if the time-oriented feature of the object 1 to be aggregated is day and night and the time-oriented feature of the object 2 to be aggregated is night, the object 1 to be aggregated and the object 2 to be aggregated are aggregated into a new aggregated object, and then the new aggregated object is day and night.
S103D: and under the condition that the preset aggregation termination condition is met, taking each aggregation object as each unit cluster corresponding to the text.
Specifically, when the preset aggregation termination condition is met, the object aggregation process is ended, and each new aggregation object is used as each unit cluster of the text object.
In contrast, in the case where the preset aggregation termination condition is not satisfied, it may be considered that the process of object aggregation is not yet completed, and at this time, each aggregation object cannot be used as the unit cluster corresponding to the text, and the process of object aggregation may be returned to the step S103A, until the preset aggregation termination condition is satisfied.
In one embodiment of the present invention, the preset polymerization termination condition may include at least one of the following:
and (one) the polymerization times reach the preset times.
Specifically, the preset number of times may be 20 times, 25 times, or the like.
(II) the new aggregate is identical to the aggregate before aggregation.
For example, the aggregate object 1 before aggregation includes a text unit 1 and a text unit 2, and the aggregate object 2 includes a text unit 3 and a text unit 4.
The new aggregate object 1 contains a text unit 1 and a text unit 2, and the new aggregate object 2 contains a text unit 3 and a text unit 4. All the polymerization targets are equal to those before polymerization, and thus the above-described preset polymerization termination condition is satisfied.
And (III) the number of the first target aggregate objects is larger than the number of the preset objects.
The first target aggregate object is: the number of characters contained is greater than the preset number of aggregate objects.
Specifically, the preset number may be calculated by the following formula:
wherein the thre_length is the preset number, the length_script is the number of characters contained in the text, and the n_thre is the expected scenario text number.
From the above, feature matching is performed on each pair of adjacent aggregation objects, object aggregation is performed on each pair of adjacent aggregation objects according to the matching result, object features of new aggregation objects are updated, feature matching is performed again until a preset aggregation termination condition is met, and the accuracy of each unit cluster corresponding to the determined text is higher due to multiple feature matching and object aggregation.
Referring to fig. 3, a flow chart of a third scenario text determination method is provided, and the aforementioned step S103A may be implemented by the following steps S103A1-S103A2, compared to the aforementioned embodiment shown in fig. 2.
S103A1: for each type of object feature of the aggregate object, the similarity of each pair of adjacent aggregate objects on the object feature is calculated as the local similarity.
Since the aggregate object may have a plurality of different types of object features, each type of object feature reflects local information of a portion of the aggregate object, for each type of object feature, local similarity of pairs of adjacent aggregate objects may be calculated.
Specifically, for the time-specific feature and the place-specific feature, the local similarity can be calculated by the following formula.
Wherein, in item sim In the case of time-oriented features, x and y are time-oriented features of adjacent aggregate objects, respectively, and counter () is used to calculate the number of different elements in the time-oriented features of the aggregate objects, e.g., the time-oriented feature x includes 2 days,1 night, counter (x)&counter (y) is the intersection of feature x with the number of different elements in feature y, e.g., feature x for time includes 2 days, 1 night, feature y for time includes 1 day, 2 nights, then counter (x) &counter (y) is 1 day, 1 night, len () is used to calculate the number of elements contained in the feature, and mean (len (x), len (y)) is used to calculate the average of len (x) and len (y).
Wherein, in item sim In the case of the feature for location, x and y are the features for location of the neighboring aggregation object, respectively, and counter () is used to calculate the number of different elements in the feature for location of the aggregation object, for example, the feature for location x includes 2 playgrounds, 1 classroom, and counter (x)&counter (y) is the intersection of feature x with the number of different elements in feature y, e.g., feature x for a place includes 2 playgrounds, 1 classroom, feature y for a place includes 1 playground, 2 classrooms, then counter (x)&counter (y) is 1 playground, 1 classroom, and len () is used to calculate the number of elements contained in a feature.
Specifically, for the character-specific feature, the local similarity may be calculated by cosine similarity calculation. In the case where text characters are included in the text content described by the text units, local similarity of features for the text characters may also be calculated by cosine similarity calculation.
S103A2: and obtaining the overall similarity of each pair of adjacent aggregate objects on all object characteristics according to the local similarity of each pair of adjacent aggregate objects.
Specifically, since each local similarity reflects the similarity of each pair of adjacent aggregate objects on each kind of object feature, the overall similarity of each pair of adjacent aggregate objects for all kinds of object features can be obtained by performing statistical calculation on each similarity.
The overall similarity may be obtained by performing weighted calculation or calculating an average value of the local similarities.
In addition, when the number of text units included in the aggregate object is large, the data included in the object feature of the aggregate object is large, and when the number of text units included in the aggregate object is small, the data included in the object feature of the aggregate object is small, so that the aggregate object having a large number of text units included therein tends to have a low similarity with the aggregate objects having a small number of other text units included therein, and therefore the overall similarity can be obtained by the following steps C to D, so that the aggregate object having a large number of text units included therein and the aggregate object having a small number of text units included therein are more likely to undergo object aggregation.
Step C: and carrying out statistical calculation on the local similarity of each pair of adjacent aggregation objects to obtain the initial value of the overall similarity of each pair of adjacent aggregation objects.
Specifically, each local similarity may be weighted or an average value may be calculated, to obtain an initial value of the overall similarity.
Step D: and adjusting initial values of the overall similarity corresponding to each pair of adjacent aggregation objects according to the following expression to obtain the overall similarity of each pair of adjacent aggregation objects:
wherein W is the overall similarity, W 0 For the initial value of the overall similarity, a, b, c and d are preset parameters, and the setting is performed under the condition of calculating the similarity of the front aggregation object relative to the rear aggregation object in the adjacent aggregation objects size For the number of text units contained in the aggregation object at the back end, in the case of calculating the similarity of the aggregation object at the back end in the adjacent aggregation objects with respect to the aggregation object at the front end, the setting is performed size Is the number of text units contained in the aggregate object at the front end.
For example, a may be 8, b may be 7, c may be 1, and d may be 1.5.
Since the more text units contained in the aggregate object,the larger the calculated value is, the larger the calculated overall similarity is, so that under the condition that more text units are contained in the aggregation object, the higher the overall similarity is calculated, and object aggregation is easier to carry out between the aggregation object with fewer text units and the aggregation object with more text units.
From the above, since the characters in the aggregation objects included in the scenario text corresponding to the same scenario describe the same scenario together, the similarity between the object features of the aggregation objects included in the same scenario text is high, and feature matching can be performed by calculating the similarity of the object features of adjacent aggregation objects.
Referring to fig. 4, a flow chart of a fourth scenario text determination method is provided, and the aforementioned step S103B may be implemented by the following step S103B1, compared to the aforementioned embodiment shown in fig. 2.
S103B1: and selecting an adjacent aggregation object with the highest overall similarity with the second target aggregation object aiming at each second target aggregation object, and carrying out object aggregation on the second target aggregation object and the selected adjacent aggregation object to obtain a new aggregation object.
Wherein the second target aggregate object is: the number of text units contained is less than a preset number of aggregate objects.
From the above, it can be seen that, since the scenario text in the text often needs to reach a certain length, the characters therein can clearly describe the corresponding scenario, so that the shorter aggregate object is not in line with the actual text. Therefore, if the length of the obtained aggregation object is determined to be shorter, the aggregation object can be aggregated with the adjacent aggregation object to form a new longer aggregation object, so that the aggregation object is consistent with the situation of an actual text.
Referring to fig. 5, a flow chart of a fifth scenario text determination method is provided, and the aforementioned step S103A may be implemented by the following steps S103A3-S103A4, compared to the aforementioned embodiment shown in fig. 2.
S103A3: and determining adjacent aggregation objects, of which corresponding text units in the aggregation objects do not contain target characters.
Since some characters in a text unit can indicate the relation between the text unit and adjacent text units in text content, for example, if a text unit contains characters three years later, the text unit is inconsistent with the text content described by the text unit adjacent to the text unit before the text unit in time, and the text unit adjacent to the text unit before the text unit are not considered to belong to the same plot, when the text unit is subjected to feature matching, the text unit and the text unit adjacent to the text unit before the text unit are not required to be subjected to feature matching any more, and the two text units are determined not to belong to the same plot.
Based on the above thought, when feature matching is performed on the aggregation objects, the above characters which can indicate the relationship with the adjacent text units can also be used for performing aggregation object screening, so that the feature matching efficiency between the aggregation objects is improved.
Wherein, the target characters are as follows: a character representing that a pair of adjacent aggregation objects cannot be subject to object aggregation.
Specifically, the target characters may include a forward target character and a backward target character.
The forward target characters are as follows: the aggregate object representing the forward target character and the aggregate object adjacent to the front end thereof do not belong to the same story text.
For example, if the aggregate object includes characters such as "three years later", the time difference between the scenario described by the characters in the aggregate object and the characters of the aggregate object adjacent to the front end of the aggregate object is large, and therefore the aggregate object and the aggregate object adjacent to the front end of the aggregate object do not belong to the same scenario text.
The backward target characters are as follows: the aggregate object containing the backward target character and the aggregate object adjacent to the rear end thereof are not represented as belonging to the same scenario text.
For example, if the aggregate object includes characters such as "Zhang Sanzhu" and the like, the places of the episodes described by the characters in the aggregate object and the characters of the aggregate object adjacent to the rear end thereof are greatly different, and therefore the aggregate object and the aggregate object adjacent to the rear end thereof do not belong to the same episode text.
S103A4: and performing feature matching on the determined object features of each pair of adjacent aggregation objects.
From the above, if the corresponding text unit in the aggregate object contains the target character, it is explained that the text unit and the adjacent aggregate object belong to different scenario texts, so that feature matching is only required to be performed on the adjacent aggregate object which does not contain the target character, thereby reducing the number of aggregate objects performing feature matching and accelerating the efficiency of determining the scenario text.
The scenario text determining method provided by the embodiment of the present invention is described below by way of a specific example with reference to fig. 6.
Referring to fig. 6, a flow diagram of a scenario text determination method is provided.
When the text is a script and text characters are contained in text contents of each scene description of the script, each text unit is each scene, the script contains n text units in total, each scene 1-scene n is taken as an aggregation object 1-aggregation object n, that is, initial values of the aggregation objects 1-n are respectively 1-scene n.
Respectively identifying time information of each session to obtain a characteristic aiming at time, for example, the characteristic aiming at time of session 1 is daytime, the characteristic aiming at time of session 2 is night, and the like;
Respectively identifying the place information of each occasion to obtain the features aiming at places, such as playgrounds with the features aiming at places of the occasions 1 and classrooms with the features aiming at places of the occasions 2;
identifying text character information of each scene respectively to obtain characteristics aiming at the text character, wherein the characteristics aiming at the text character comprise main angles, match angles and characteristics of other characters, such as the characteristics aiming at the text character of scene 1 are as follows: zhang three is the principal angle, liu four is the matching angle, the character of the scene 2 aiming at the text role is Liu four is the principal angle, zhang three is the matching angle, and the like;
features of each character in each session are obtained through the tfidf model, for example, the importance degree of the character string running in session 1 is 0.7, and the occurrence number is 5; the degree of importance of "walking" is 0.4, and the number of occurrences is 3; the character string runs in the session 2 has the importance degree of 0.5 and the occurrence number of 7; the "walking" has an importance of 0.6 and a number of occurrences of 6 times.
And respectively taking the content characteristics of the shots 1-n as initial values of the object characteristics of the aggregation objects 1-n, after obtaining the object characteristics of the aggregation objects, carrying out object aggregation on adjacent aggregation objects with similar object characteristics in the aggregation objects 1-n according to the characteristic matching results of the object characteristics, determining the adjacent aggregation objects as new aggregation objects, and determining the object characteristics of the new aggregation objects according to the object characteristics of the aggregated objects.
For example, if the object features of the aggregate object 1 and the aggregate object 2 in the adjacent aggregate objects match with each other, the aggregate object 1 and the aggregate object 2 are aggregated into a new aggregate object, and the new aggregate object may still be referred to as the aggregate object 1, and at this time, the aggregate object 1 includes the field 1 and the field 2, and the object feature of the new aggregate object 1 is determined according to the object feature of the original aggregate object 1 and the object feature of the original aggregate object 2.
If the object features of the aggregate object 4 and the aggregate object 5 in the adjacent aggregate objects match with each other, the aggregate object 4 and the aggregate object 5 are aggregated into a new aggregate object, which may still be referred to as the aggregate object 4, where the aggregate object 4 includes the field 4 and the field 5, and the object feature of the new aggregate object 4 is determined according to the object feature of the original aggregate object 4 and the object feature of the original aggregate object 5.
The aggregate object then includes: aggregate object 1, aggregate object 3, aggregate object 4, aggregate object 6 … … aggregate object n.
If the preset polymerization termination condition is not satisfied, the polymerization object needs to be subject to the polymerization again. That is, the above-described aggregate objects 1, 3, 4, and 6 … … are subject to aggregation until the aggregated aggregate objects satisfy the aggregation termination condition. The process of object aggregation for other aggregation objects is similar to the process of object aggregation for aggregation object 1-aggregation object n described above, and will not be described here.
If the preset aggregation termination condition is met at this time, the aggregation object 1, the aggregation object 3, the aggregation object 4 and the aggregation object 6 and … … are respectively determined as each unit cluster, and then characters in text units included in each unit cluster are determined as scenario texts of each scenario described by the texts.
For example, referring to FIG. 6, final scenario 1 and scenario 2 are determined as scenario text 1, and scenario n-1 and scenario n are determined as scenario text m.
Corresponding to the scenario text determining method, the embodiment of the invention also provides a scenario text determining device.
Referring to fig. 7, an embodiment of the present invention provides a schematic structural diagram of a first scenario text determining apparatus, and specifically, the apparatus includes:
a unit determining module 701, configured to determine each text unit in a text, where each text unit includes characters that are arranged in succession in a portion of the text, and there is no intersection between each text unit at a position in the text;
the feature extraction module 702 is configured to extract content features of each text unit, where the content features are: features reflecting the content described by the text elements;
the unit cluster determining module 703 is configured to perform feature matching on content features of text units adjacent to each other in the text, determine text units with similar content features, determine the text units with similar content features to the same unit cluster, and determine each unit cluster corresponding to the text;
The scenario text determining module 704 is configured to determine characters in text units included in each unit cluster as scenario text for describing each scenario in the text, respectively.
As can be seen from the above, when determining a scenario text, the embodiment of the present invention performs feature matching according to the content features of text units, since text units in the same scenario text are similar, that is, the matching degree between text units is higher, and text units in different scenario texts are dissimilar, that is, the matching degree between text units is lower, therefore, adjacent and similar text units are determined as the same unit cluster, and thus, characters in text units included in each unit cluster are scenario texts in texts for describing each scenario respectively, and therefore, each scenario text in the text can be determined through the scheme provided by the embodiment of the present invention. Compared with the prior art, when the scheme provided by the embodiment of the invention is applied to determine the scenario text, manual participation is not needed, so that the efficiency of determining the scenario text in the text can be improved.
Referring to fig. 8, a schematic structural diagram of a second scenario text determining apparatus is provided, and the above-mentioned unit cluster determining module 703 includes:
The feature matching submodule 703A is configured to perform feature matching on object features of each pair of adjacent aggregate objects in the aggregate objects, where the aggregate objects are: the method comprises the steps that an operation unit is used for carrying out object aggregation, an initial object of each aggregation object comprises a text unit, each text unit corresponds to each initial object one by one, an initial value of an object characteristic of each aggregation object is a content characteristic of the text unit contained in the aggregation object, and text units contained in adjacent aggregation objects are adjacent in position in the text;
the object aggregation submodule 703B is configured to perform object aggregation on each pair of adjacent aggregation objects according to the matching result, so as to obtain a new aggregation object;
a feature obtaining submodule 703C, configured to obtain an object feature of each new aggregate object according to the object feature of the aggregated object of each new aggregate object;
the unit cluster determining submodule 703D is configured to, when a preset aggregation termination condition is met, respectively use each aggregation object as each unit cluster corresponding to the text.
In one embodiment of the present invention, the preset polymerization termination condition includes at least one of the following:
the polymerization times reach the preset times;
The new aggregate object is the same as the aggregate object before aggregation;
the number of the first target aggregate objects is larger than the number of the preset objects, and the first target aggregate objects are as follows: the number of characters contained is greater than the preset number of aggregate objects.
From the above, feature matching is performed on each pair of adjacent aggregation objects, object aggregation is performed on each pair of adjacent aggregation objects according to the matching result, object features of new aggregation objects are updated, feature matching is performed again until a preset aggregation termination condition is met, and the accuracy of each unit cluster corresponding to the determined text is higher due to multiple feature matching and object aggregation.
Referring to fig. 9, a schematic structural diagram of a third scenario text determining apparatus is provided, and the feature matching sub-module 703A includes:
a similarity calculation unit 703A1, configured to calculate, for each type of object feature of the aggregate object, a similarity of each pair of adjacent aggregate objects on the object feature, as a local similarity;
a similarity obtaining unit 703A2, configured to obtain, according to the local similarity of each pair of adjacent aggregate objects, the overall similarity of each pair of adjacent aggregate objects on all object features.
In one embodiment of the present invention, the similarity obtaining unit 703A2 is specifically configured to:
carrying out statistical calculation on the local similarity of each pair of adjacent aggregation objects to obtain an initial value of the overall similarity of each pair of adjacent aggregation objects;
and adjusting initial values of the overall similarity corresponding to each pair of adjacent aggregation objects according to the following expression to obtain the overall similarity of each pair of adjacent aggregation objects:
wherein W is the overall similarity, W 0 For the initial value of the overall similarity, a, b, c and d are preset parameters, and adjacent parameters are calculatedIn the case of similarity of the front end aggregate object with respect to the back end aggregate object, the setting size For the number of text units contained in the aggregation object at the back end, in the case of calculating the similarity of the aggregation object at the back end in the adjacent aggregation objects relative to the aggregation object at the front end, the setting size Is the number of text units contained in the aggregate object at the front end.
From the above, since the characters in the aggregation objects included in the scenario text corresponding to the same scenario describe the same scenario together, the similarity between the object features of the aggregation objects included in the same scenario text is high, and feature matching can be performed by calculating the similarity of the object features of adjacent aggregation objects.
In one embodiment of the present invention, the object aggregation submodule 703B is specifically configured to:
selecting an adjacent aggregation object with highest overall similarity with the second target aggregation object aiming at each second target aggregation object, and carrying out object aggregation on the second target aggregation object and the selected adjacent aggregation object to obtain a new aggregation object, wherein the second target aggregation object is: the number of text units contained is less than a preset number of aggregate objects.
From the above, it can be seen that, since the scenario text in the text often needs to reach a certain length, the characters therein can clearly describe the corresponding scenario, so that the shorter aggregate object is not in line with the actual text. Therefore, if the length of the obtained aggregation object is determined to be shorter, the aggregation object can be aggregated with the adjacent aggregation object to form a new longer aggregation object, so that the aggregation object is consistent with the situation of an actual text.
In one embodiment of the present invention, the feature matching sub-module 703A is specifically configured to:
determining adjacent aggregation objects, which do not contain target characters, of corresponding text units in the aggregation objects, wherein the target characters are as follows: a character indicating that a pair of adjacent aggregation objects cannot be subject to aggregation;
And performing feature matching on the determined object features of each pair of adjacent aggregation objects.
From the above, if the corresponding text unit in the aggregate object contains the target character, it is explained that the text unit and the adjacent aggregate object belong to different scenario texts, so that feature matching is only required to be performed on the adjacent aggregate object which does not contain the target character, thereby reducing the number of aggregate objects performing feature matching and accelerating the efficiency of determining the scenario text.
In one embodiment of the invention, the content features include at least one of the following features: features for time, features for place, features for text character, features for character.
In one embodiment of the present invention, the content feature includes a feature for a character, and in the case that the text content described by the text unit includes a text character, the feature extraction module 702 is specifically configured to:
identifying character names of text characters in each text unit;
for each text unit, extracting features for the remaining characters, the remaining characters being: in the text unit, characters other than the character name.
As can be seen from the above, in the case where the text is a text such as a novel or a script, the character names of the text characters are likely to be unusual names, and therefore, if the word segmentation process is directly performed on each text unit, each character string in the obtained text unit is affected by the unusual names, and the character string obtained by the word segmentation process includes some characters in the unusual names, so that the character names in the characters are removed, and then the characteristics of the characters are extracted, thereby improving the accuracy of the obtained characteristics of the characters.
The embodiment of the invention also provides an electronic device, as shown in fig. 10, which comprises a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 complete communication with each other through the communication bus 1004,
a memory 1003 for storing a computer program;
the processor 1001 is configured to implement the method steps described in any of the scenario text determining method embodiments when executing the program stored in the memory 1003.
When the electronic equipment provided by the embodiment of the invention is used for determining the plot text in the text, feature matching is carried out according to the content features of the text units, and as the text units in the same plot text are similar, namely, the matching degree between the text units is higher, and the text units of different plot texts are dissimilar, namely, the matching degree between the text units is lower, the adjacent and similar text units are determined to be the same unit cluster, and thus, characters in the text units contained in each unit cluster are plot texts used for describing all plots in the text respectively. Compared with the prior art, when the scheme provided by the embodiment of the invention is applied to determine the scenario text, manual participation is not needed, so that the efficiency of determining the scenario text in the text can be improved.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In a further embodiment of the present invention, a computer readable storage medium is also provided, in which a computer program is stored, which computer program, when being executed by a processor, implements the method steps of any of the scenario text determination method embodiments described above.
When the computer program stored in the computer readable storage medium provided by the embodiment is executed to determine the scenario text in the text, feature matching is performed according to the content features of the text units, and since the text units in the same scenario text are similar, that is, the matching degree between the text units is higher, that is, the matching degree between the text units of different scenario texts is lower, therefore, adjacent and similar text units are determined as the same unit cluster, and thus, characters in the text units contained in each unit cluster are respectively scenario texts in the text for describing each scenario, each scenario text in the text can be determined through the scheme provided by the embodiment of the invention. Compared with the prior art, when the scheme provided by the embodiment of the invention is applied to determine the scenario text, manual participation is not needed, so that the efficiency of determining the scenario text in the text can be improved.
In a further embodiment of the present invention, a computer program product comprising instructions is also provided, which when run on a computer, causes the computer to perform the method steps of any of the scenario text determination method embodiments described in the previous embodiments.
When the computer program product provided by the embodiment is executed to determine the scenario text in the text, feature matching is performed according to the content features of the text units, and since the text units in the same scenario text are similar, that is, the matching degree between the text units is higher, and the text units in different scenario texts are dissimilar, that is, the matching degree between the text units is lower, adjacent and similar text units are determined as the same unit cluster, so that characters in the text units contained in each unit cluster are respectively scenario texts in the text for describing each scenario, each scenario text in the text can be determined through the scheme provided by the embodiment of the invention. Compared with the prior art, when the scheme provided by the embodiment of the invention is applied to determine the scenario text, manual participation is not needed, so that the efficiency of determining the scenario text in the text can be improved.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium and the computer program product, the description is relatively simple, as it is substantially similar to the method embodiments, and relevant points are found in the partial description of the method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (8)

1. A plot text determination method, the method comprising:
determining each text unit in the text, wherein each text unit comprises characters of which the parts in the text are arranged continuously, and no intersection exists between the text units at the positions in the text;
extracting content characteristics of each text unit, wherein the content characteristics are as follows: a feature reflecting content described by a text unit, the content feature comprising at least one of: a feature for time, a feature for place, a feature for text character, a feature for weather, a feature for item, a feature for character;
for each type of object feature of the aggregate object, calculating the similarity of each pair of adjacent aggregate objects on the object feature as local similarity, wherein the aggregate objects are as follows: the method comprises the steps that an operation unit is used for carrying out object aggregation, an initial object of each aggregation object comprises a text unit, each text unit corresponds to each initial object one by one, an initial value of an object characteristic of each aggregation object is a content characteristic of the text unit contained in the aggregation object, and text units contained in adjacent aggregation objects are adjacent in position in the text;
Carrying out statistical calculation on the local similarity of each pair of adjacent aggregation objects to obtain an initial value of the overall similarity of each pair of adjacent aggregation objects;
and adjusting initial values of the overall similarity corresponding to each pair of adjacent aggregation objects according to the following expression to obtain the overall similarity of each pair of adjacent aggregation objects:
wherein W is the overall similarity, W 0 For the initial value of the overall similarity, a, b, c and d are preset parameters, and under the condition of calculating the similarity of the front aggregation object relative to the rear aggregation object in the adjacent aggregation objects, the setting is performed size For the number of text units contained in the aggregation object at the back end, in the case of calculating the similarity of the aggregation object at the back end in the adjacent aggregation objects relative to the aggregation object at the front end, the setting size The number of text units contained in the aggregate object for the front end;
performing object aggregation on each pair of adjacent aggregation objects according to the matching result to obtain new aggregation objects;
obtaining the object characteristics of each new aggregation object according to the object characteristics of the aggregated object of each new aggregation object;
under the condition that a preset aggregation termination condition is met, taking each aggregation object as each unit cluster corresponding to the text;
Characters in text units included in each unit cluster are respectively determined as scenario text for describing each scenario in the text.
2. The method of claim 1, wherein the preset polymerization termination conditions include at least one of:
the polymerization times reach the preset times;
the new aggregate object is the same as the aggregate object before aggregation;
the number of the first target aggregate objects is larger than the number of the preset objects, and the first target aggregate objects are as follows: the number of characters contained is greater than the preset number of aggregate objects.
3. The method according to claim 1 or 2, wherein the performing object aggregation on each pair of adjacent aggregate objects according to the matching result to obtain new aggregate objects includes:
selecting an adjacent aggregation object with highest overall similarity with the second target aggregation object aiming at each second target aggregation object, and carrying out object aggregation on the second target aggregation object and the selected adjacent aggregation object to obtain a new aggregation object, wherein the second target aggregation object is: the number of text units contained is less than a preset number of aggregate objects.
4. The method according to claim 1 or 2, wherein the feature matching of object features of each pair of adjacent aggregate objects in the aggregate objects comprises:
Determining adjacent aggregation objects, which do not contain target characters, of corresponding text units in the aggregation objects, wherein the target characters are as follows: a character indicating that a pair of adjacent aggregation objects cannot be subject to aggregation;
and performing feature matching on the determined object features of each pair of adjacent aggregation objects.
5. The method according to claim 1, wherein the character-specific feature is included in the content features, and wherein in the case that text characters are included in the text content of the text unit description, the extracting the content features of the respective text units includes:
identifying character names of text characters in each text unit;
for each text unit, extracting features for the remaining characters, the remaining characters being: in the text unit, characters other than the character name.
6. A storybook determining apparatus, said apparatus comprising:
the unit determining module is used for determining each text unit in the text, wherein each text unit comprises characters of which the parts in the text are arranged continuously, and the positions in the text between each text unit are not intersected;
the feature extraction module is used for extracting the content features of each text unit, wherein the content features are as follows: a feature reflecting content described by a text unit, the content feature comprising at least one of: a feature for time, a feature for place, a feature for text character, a feature for weather, a feature for item, a feature for character;
The unit cluster determining module is used for calculating the similarity of each pair of adjacent aggregate objects on the object characteristics aiming at the object characteristics of each type of the aggregate objects as the local similarity, wherein the aggregate objects are as follows: the method comprises the steps that an operation unit is used for carrying out object aggregation, an initial object of each aggregation object comprises a text unit, each text unit corresponds to each initial object one by one, an initial value of an object characteristic of each aggregation object is a content characteristic of the text unit contained in the aggregation object, and text units contained in adjacent aggregation objects are adjacent in position in the text; carrying out statistical calculation on the local similarity of each pair of adjacent aggregation objects to obtain an initial value of the overall similarity of each pair of adjacent aggregation objects; and adjusting initial values of the overall similarity corresponding to each pair of adjacent aggregation objects according to the following expression to obtain the overall similarity of each pair of adjacent aggregation objects:
wherein W is the overall similarity, W 0 For the initial value of the overall similarity, a, b, c and d are preset parameters, and under the condition of calculating the similarity of the front aggregation object relative to the rear aggregation object in the adjacent aggregation objects, the setting is performed size For the number of text units contained in the aggregation object at the back end, in the case of calculating the similarity of the aggregation object at the back end in the adjacent aggregation objects relative to the aggregation object at the front end, the setting size The number of text units contained in the aggregate object for the front end; performing object aggregation on each pair of adjacent aggregation objects according to the matching result to obtain new aggregation objects; obtaining the object characteristics of each new aggregation object according to the object characteristics of the aggregated object of each new aggregation object; under the condition that a preset aggregation termination condition is met, taking each aggregation object as each unit cluster corresponding to the text;
and the plot text determining module is used for determining characters in text units included in each unit cluster as plot texts used for describing each plot in the text.
7. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5.
CN202010724600.8A 2020-07-24 2020-07-24 Method and device for determining scenario text Active CN111859894B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010724600.8A CN111859894B (en) 2020-07-24 2020-07-24 Method and device for determining scenario text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010724600.8A CN111859894B (en) 2020-07-24 2020-07-24 Method and device for determining scenario text

Publications (2)

Publication Number Publication Date
CN111859894A CN111859894A (en) 2020-10-30
CN111859894B true CN111859894B (en) 2024-01-23

Family

ID=72949488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010724600.8A Active CN111859894B (en) 2020-07-24 2020-07-24 Method and device for determining scenario text

Country Status (1)

Country Link
CN (1) CN111859894B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002041544A (en) * 2000-07-25 2002-02-08 Toshiba Corp Text information analyzing device
JP2011215899A (en) * 2010-03-31 2011-10-27 Kddi Corp Similar document retrieval device
CN103136359A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Generation method of single document summaries
CN108804563A (en) * 2018-05-22 2018-11-13 阿里巴巴集团控股有限公司 A kind of data mask method, device and equipment
CN109739975A (en) * 2018-11-15 2019-05-10 东软集团股份有限公司 Focus incident abstracting method, device, readable storage medium storing program for executing and electronic equipment
JP2020052961A (en) * 2018-09-28 2020-04-02 キヤノン株式会社 Content providing method, content providing system, information processing device, and program
CN111401031A (en) * 2020-03-05 2020-07-10 支付宝(杭州)信息技术有限公司 Target text determination method, device and equipment
CN111414479A (en) * 2020-03-16 2020-07-14 北京智齿博创科技有限公司 Label extraction method based on short text clustering technology

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9213687B2 (en) * 2009-03-23 2015-12-15 Lawrence Au Compassion, variety and cohesion for methods of text analytics, writing, search, user interfaces
US10073835B2 (en) * 2013-12-03 2018-09-11 International Business Machines Corporation Detecting literary elements in literature and their importance through semantic analysis and literary correlation
US10467276B2 (en) * 2016-01-28 2019-11-05 Ceeq It Corporation Systems and methods for merging electronic data collections
CN106910501B (en) * 2017-02-27 2019-03-01 腾讯科技(深圳)有限公司 Text entities extracting method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002041544A (en) * 2000-07-25 2002-02-08 Toshiba Corp Text information analyzing device
JP2011215899A (en) * 2010-03-31 2011-10-27 Kddi Corp Similar document retrieval device
CN103136359A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Generation method of single document summaries
CN108804563A (en) * 2018-05-22 2018-11-13 阿里巴巴集团控股有限公司 A kind of data mask method, device and equipment
JP2020052961A (en) * 2018-09-28 2020-04-02 キヤノン株式会社 Content providing method, content providing system, information processing device, and program
CN109739975A (en) * 2018-11-15 2019-05-10 东软集团股份有限公司 Focus incident abstracting method, device, readable storage medium storing program for executing and electronic equipment
CN111401031A (en) * 2020-03-05 2020-07-10 支付宝(杭州)信息技术有限公司 Target text determination method, device and equipment
CN111414479A (en) * 2020-03-16 2020-07-14 北京智齿博创科技有限公司 Label extraction method based on short text clustering technology

Also Published As

Publication number Publication date
CN111859894A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
CN106156204B (en) Text label extraction method and device
US10423664B2 (en) Method and system for providing recommended terms
CN105022754B (en) Object classification method and device based on social network
US20170330054A1 (en) Method And Apparatus Of Establishing Image Search Relevance Prediction Model, And Image Search Method And Apparatus
CN105653705B (en) Hot event searching method and device
CN111814770B (en) Content keyword extraction method of news video, terminal device and medium
US8566303B2 (en) Determining word information entropies
CN105045875B (en) Personalized search and device
US20110225161A1 (en) Categorizing products
CN106919575B (en) Application program searching method and device
CN109508378B (en) Sample data processing method and device
CN110019794B (en) Text resource classification method and device, storage medium and electronic device
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
CN107526846B (en) Method, device, server and medium for generating and sorting channel sorting model
US20180210897A1 (en) Model generation method, word weighting method, device, apparatus, and computer storage medium
CN111708909B (en) Video tag adding method and device, electronic equipment and computer readable storage medium
CN110210028A (en) For domain feature words extracting method, device, equipment and the medium of speech translation text
WO2022007626A1 (en) Video content recommendation method and apparatus, and computer device
CN111914564B (en) Text keyword determination method and device
CN107908649B (en) Text classification control method
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN111859894B (en) Method and device for determining scenario text
CN112613296A (en) News importance degree acquisition method and device, terminal equipment and storage medium
CN115827990B (en) Searching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant