CN111291535A

CN111291535A - Script processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111291535A
Application number: CN202010136869.4A
Authority: CN
Inventors: 郏昕; 阳任科; 赵冲翔
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2020-06-16
Anticipated expiration: 2040-03-02
Also published as: CN111291535B

Abstract

The embodiment of the invention provides a script processing method and device, electronic equipment and a computer readable storage medium, and belongs to the technical field of computers. According to the method, a script to be processed is divided into a plurality of scripts according to a preset expression range of a set number, the scripts are divided into a plurality of scene texts according to the preset expression range of the scene number, scene information characters contained in the scene texts are extracted, the scene information characters contained in the scene texts, the scene numbers of the scene texts and the set numbers of the episodes to which the scene texts belong are determined as information to be sorted of the scene texts, and the information to be sorted of the scene texts and text texts in the scene texts are combined according to a preset form. The method has the advantages that the single scene text is taken as a processing object for extraction, so that the coupling degree in the script can be reduced to a certain extent, and the extraction accuracy is improved. And recombining the scene texts according to a preset form, so that the internal forms of the scene texts are kept consistent, and the processing is further facilitated.

Description

Script processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a scenario processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In application scenarios such as standardized management, overall shooting management, intelligent scenario evaluation and the like, analysis of information to be sorted in scenarios is often involved. The information to be sorted refers to the scene information characters of the set number, the scene number, the time, the place and the people. The scene information characters are often dispersed in the script content, several fixed scene information format templates are often predefined in the prior art, and the information to be sorted is directly extracted from the script by using the fixed format templates.

Due to different writing habits of drama editing, the writing structure of the text in each script is greatly different, so that when the difference between the format of the script and the format in the fixed format template is large, the accuracy of the information to be sorted extracted according to the fixed format template is low.

Disclosure of Invention

The invention provides a script processing method, a script processing device, electronic equipment and a computer readable storage medium, which are used for solving the problem of low accuracy of extracted information to be sorted.

In a first aspect of the present invention, there is provided a scenario processing method applied to an electronic device, the method including:

determining a set number and a position of the set number contained in the script to be processed according to a preset set number expression range, and dividing the script to be processed into a plurality of scripts according to the contained set number and the position of the set number;

for at least one episode, determining a scene number and a position of the scene number contained in the episode according to a preset scene number expression range, and dividing the episode into a plurality of scene texts according to the scene number and the position of the scene number;

for at least one scene text, extracting scene information characters contained in the scene text;

determining scene information characters contained in the scene text, a scene number of the scene text and a collection number of an episode to which the scene text belongs as information to be sorted of the scene text;

and combining the information to be sorted of the scene text and the text in the scene text according to a preset form to form a target script.

In a second aspect of the present invention, there is also provided a scenario processing apparatus applied to an electronic device, the apparatus including:

the first determining module is used for determining the episode number and the position of the episode number contained in the scenario to be processed according to a preset episode number expression range, and dividing the scenario to be processed into a plurality of episodes according to the contained episode number and the position of the episode number;

the second determining module is used for determining scene numbers and positions of the scene numbers contained in at least one episode according to a preset scene number expression range, and dividing the episode into a plurality of scene texts according to the scene numbers and the positions of the scene numbers;

the extraction module is used for extracting scene information characters contained in the scene text for at least one scene text;

a third determining module, configured to determine, as information to be sorted of the scene text, a scene information character included in the scene text, a scene number of the scene text, and a collection number of an episode to which the scene text belongs;

and the combination module is used for combining the information to be sorted of the scene text and the text in the scene text according to a preset form to form the target script.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute any of the scenario processing methods described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the scenario processing methods described above.

The scenario processing method provided by the embodiment of the invention can determine the episode number and the position of the episode number contained in the scenario to be processed according to the preset episode number expression range, divide the scenario to be processed into a plurality of scenarios according to the contained episode number and the position of the episode number, determine the scene number and the position of the scene number contained in the scenario according to the preset scene number expression range, divide the scenario into a plurality of scene texts according to the scene number and the position of the scene number, extract the scene information characters contained in the scene text for at least one scene text, determine the scene information characters contained in the scene text, the scene number of the scene text and the episode number of the scenario to which the scene text belongs as the information to be sorted of the scene text, combine the information to be sorted of the scene text and the text in the scene text according to the preset form, and forming a target script. In the embodiment of the invention, the script to be processed is divided into the scene texts, and the single scene text is taken as the processing object for extraction, so that the coupling degree in the script can be reduced to a certain extent, the interference of the script format on the extraction of the scene information can be reduced, and the extraction accuracy is improved. Meanwhile, after the scene information is extracted, the scene texts are recombined according to a preset form, so that the internal forms of the scene texts in the script are kept consistent, and the script is convenient to process subsequently.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart illustrating steps of a scenario processing method according to an embodiment of the present invention;

FIG. 2-1 is a flow chart of steps of another scenario processing method provided by an embodiment of the present invention;

FIG. 2-2 is a schematic diagram of a pretreatment provided by an embodiment of the present invention;

FIGS. 2-3 are schematic process flow diagrams provided by embodiments of the present invention;

FIGS. 2-4 are schematic diagrams of a process provided by an embodiment of the present invention;

fig. 2-5 are schematic diagrams illustrating a scene text according to an embodiment of the present invention;

fig. 3 is a block diagram of a scenario processing apparatus according to an embodiment of the present invention;

fig. 4 is a structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

Fig. 1 is a flowchart of steps of a scenario processing method provided by an embodiment of the present invention, where the method may be applied to an electronic device, and as shown in fig. 1, the method may include:

step 101, determining a set number and a position of the set number included in the scenario to be processed according to a preset set number expression range, and dividing the scenario to be processed into a plurality of scenarios according to the included set number and the position of the set number.

In this embodiment of the present invention, the preset set number expression range may be obtained in advance before the step is performed. Set numbers may be included in the range of set number expressions in the form of a plurality of expressions. Specifically, when the set numbers of these expression forms are obtained, a large number of sample scenarios may be obtained, where the sample scenarios may be scenarios crawled from a network. Then, the characters of the sample script representing the collection number are extracted, and then the characters representing the collection number are summarized according to the expression form, so that the collection numbers of various different expression forms are obtained. The expression form of the set number may be characters, numbers, english, and roman symbols, etc.

Further, in an actual application scenario, the scenario would include multiple episodes, and each episode would include a different scene, where each scene corresponds to a segment of scene text, and the scene text is used to depict the content of the scene. Each episode and scene text can have corresponding numbers to facilitate distinguishing. Therefore, in this step, the set number and the position of the set number included in the scenario to be processed may be determined based on the preset set number expression range. Specifically, the set number included in the preset set number expression range may be matched with the content included in the scenario to be processed, then the matched character is determined as the set number, and the position where the matched character is located is determined as the position of the set number. Furthermore, the episode contained in the scenario to be processed can be obtained by dividing according to the identified episode number and position. Because a plurality of collection numbers in different expression forms are collected in advance, the problem that collection numbers cannot be accurately identified due to different writing habits of script writers can be avoided to a certain extent by using the preset collection numbers for matching, and the accuracy of episode division can be improved.

And 102, for at least one episode, determining a scene number and a position of the scene number contained in the episode according to a preset scene number expression range, and dividing the episode into a plurality of scene texts according to the scene number and the position of the scene number.

In this embodiment of the present invention, the preset scene number expression range may be obtained in advance before the step is executed. The scene number may be included in the scene number expression range in a plurality of expression forms. Specifically, when the scene numbers in the expression forms are obtained, a large number of sample scripts can be obtained, then characters of the sample scripts, which represent the scene numbers, are extracted, and then the characters representing the scene numbers are summarized according to the expression forms, so that the scene numbers in different expression forms are obtained. The expression form of the scene number may be characters, numbers, english, and roman symbols, etc.

Further, when determining the scene number and the position included in the drama set based on the preset scene number expression range, the scene number included in the preset scene number expression range may be matched with the content included in the drama set, then the matched character is determined as the scene number, and the position where the matched character is located is determined as the position of the scene number. Furthermore, the episode can be segmented according to the identified scene number and position, and further the scene text contained in the episode. As the scene numbers in different expression forms are collected in advance, the matching mode of the preset scene numbers is used in the embodiment of the invention, so that the problem that the scene numbers cannot be accurately identified due to different writing habits of script writers can be avoided to a certain extent, and the accuracy of scene text division can be improved.

Step 103, extracting scene information characters contained in the scene text for at least one scene text.

In the embodiment of the present invention, the scene information characters contained in the scene text may be characters, words, data, english characters, and the like used for representing the scene information. By extracting the scene information characters contained in the scene text, the scene information contained in the scene text can be determined.

And step 104, determining scene information characters contained in the scene text, the scene number of the scene text and the collection number of the episode to which the scene text belongs as the information to be sorted of the scene text.

For example, assuming that the scene information characters included in the scene text are "raining, school operation, and all teachers", the scene number of the scene text is 23, and the episode number of the episode to which the scene text belongs is 1, then "1, 23, raining, school operation, and all teachers" may be determined as the information to be sorted of the scene text.

And 105, combining the information to be sorted of the scene text and the text in the scene text according to a preset form to form a target script.

In the embodiment of the invention, the text refers to the text except the information to be sorted in the scene text. The preset format combination can be set according to actual requirements. The preset form may be that the scene information is located at the first segment of the scene text, and the text is located behind the first segment. Therefore, the scene texts are recombined according to the preset form, so that the internal forms of the scene texts in the script are kept consistent, and the script is conveniently processed subsequently.

To sum up, the scenario processing method provided in the embodiment of the present invention may determine, according to a preset presentation range of the episode number, an episode number and a location of the episode number included in the scenario to be processed, divide the scenario to be processed into a plurality of scenarios according to the included presentation range of the episode number and the location of the episode number, determine, according to the preset presentation range of the scene number, a scene number and a location of the scene number included in the scenario set, divide the scenario into a plurality of scene texts according to the scene number and the location of the scene number, extract, for at least one scene text, a scene information character included in the scene text, determine, as information to be arranged of the scene text, a scenario number of the scene text and a scenario number of the scenario to which the scene text belongs, determine, as information to be arranged of the scene text, and text to be arranged in the scene text, and combining according to a preset form to form the target script. In the embodiment of the invention, the script to be processed is divided into the scene texts, and the single scene text is taken as the processing object for extraction, so that the coupling degree in the script can be reduced to a certain extent, the interference of the script format on the extraction of the scene information can be reduced, and the extraction accuracy is improved. Meanwhile, after the scene information is extracted, the scene texts are recombined according to a preset form, so that the internal forms of the scene texts in the script are kept consistent, and the script is convenient to process subsequently.

Fig. 2-1 is a flowchart of steps of another scenario processing method provided by an embodiment of the present invention, which may be applied to an electronic device, as shown in fig. 2-1, and the method may include:

step 201, performing a preprocessing operation on the scenario to be processed.

In this step, the preprocessing operation may be an operation for normalizing the scenario to be processed. In particular, the preprocessing operations may include one or more of the following operations:

(1) and deleting the interference information in the scenario to be processed.

The disturbance information may be set according to information that actually interferes with the scenario processing. By way of example, the interference information may include at least: page number information, line code information, a blank space for the start and end positions of each line, and tabs. The page number information may be a page number of each page in the scenario to be processed, and the line code information may be a line number of each line in the scenario to be processed, and the page number and the line code information may be introduced when the text to be processed is a PDF-type file. When deleting the page number and the line code information, the formats of all the page numbers and the line codes introduced by the PDF type file can be summarized in advance, and then the script to be processed is matched based on the page numbers and the line codes with different formats. Since the page number and the line number both represent numbers, it is possible to avoid interference with the identification of the set number and the scene number by deleting these pieces of information. Since the blank and tab at the start and end positions of each line will affect the recognition of the text paragraph, the useless characters are deleted to reduce the processing of the script to be processed to a certain extent, thereby improving the processing effect. Specifically, when deleting, the scenario to be processed may be traversed line by line to detect whether there is interference information in each line, and if so, the scenario may be directly deleted. Of course, the interference information may also include other information, for example, some information that may be introduced under the word type.

(2) And converting the fonts in the script to be processed into preset fonts.

The preset font may be preset, for example, the preset font may be a simplified font. Therefore, by converting the traditional Chinese characters in the script to be processed into the simplified Chinese characters, the problem of recognition error caused by non-uniform fonts can be avoided. Specifically, the simplified character corresponding to each character in the scenario to be processed can be searched in the network, and then the simplified character is used for replacing the character, so that the conversion is realized. Of course, the conversion may also be implemented in other ways, which is not limited in the embodiment of the present invention.

(3) And converting punctuation marks in the script to be processed into punctuation marks corresponding to the punctuation marks in the symbol mapping relation according to a preset symbol mapping relation.

In this operation, when the preset symbol mapping relationship is established, punctuations having the same actual function can be classified into the same category according to the actual function of the punctuations in the text, so as to obtain multiple categories. Then, for each category, selecting one punctuation mark with the highest use frequency from the category as a representative punctuation mark of the category, and then establishing a mapping relation between all punctuation marks in the category and the representative punctuation mark so as to obtain the symbol mapping relation. That is, the punctuations corresponding to all punctuations in the category in the symbol mapping relationship are the representative punctuations. Further, when the script operation is executed, punctuation marks corresponding to the punctuation marks in the script to be processed can be searched in the mapping relation, and then the corresponding punctuation marks are used for replacing the original script to be processed in the script to be processed, so that the punctuation marks are completely mapped. Therefore, the punctuation marks are converted, so that the punctuation marks playing the same role can use the same expression mode, the script to be processed is more standard, and the interference on subsequent processing can be reduced to a certain extent.

(4) And deleting punctuation marks which do not belong to the available punctuation mark range value in the script to be processed according to a preset available punctuation mark range value.

In this operation, the available punctuation mark range value may be set according to actual conditions, for example, punctuation marks that do not interfere with subsequent processing operations may be used as members included in the range value, and the range value may be constructed. Specifically, when executing the present operation, punctuation marks in the scenario to be processed may be compared with punctuation marks included in the available punctuation mark range value one by one. If the punctuation mark same as the punctuation mark exists in the available punctuation mark range value, the punctuation mark is reserved, and if the punctuation mark same as the punctuation mark does not exist in the available punctuation mark range value, the punctuation mark is deleted. Therefore, by selectively deleting some punctuations, the scenario to be processed can be more standard, and the interference on the subsequent processing can be reduced to a certain extent. Of course, the preprocessing operation may also include other operations, for example, converting the full-size symbols in the scenario to be processed into half-size symbols, and deleting all the characters in the unspecified range in the scenario to be processed, where the unspecified range may be a part that is predefined according to actual needs and does not need to be processed, so as to further improve the specification, which is not limited in the embodiment of the present invention.

For example, fig. 2-2 is a schematic diagram of a preprocessing provided by an embodiment of the present invention, and as shown in fig. 2-2, a to-be-processed scenario may be subjected to a preprocessing operation in link two, so as to finally obtain a preprocessed to-be-processed scenario. The range value and the regular expression read in the first step may be used in the preprocessing and the subsequent steps. Thus, by reading in advance, the processing efficiency can be improved to some extent. It should be noted that only a part of the read content is shown in the figure, and in practical applications, the read content may also include other content.

Step 202, according to a preset set number expression range, determining a set number and a position of the set number contained in the scenario to be processed, and dividing the scenario to be processed into a plurality of scenarios according to the contained set number and the position of the set number.

Specifically, this step can be realized by the following substeps (1) to (3):

substep (1): generating a set number regular expression according to the preset set number expression range; the set number regular expression is defined with the set number contained in the set number expression range.

In this step, the set numbers of the different expression forms included in the set number expression range may be used as parameters of the regular expression to generate a group of characters describing characteristics of the character string, and the group of characters may represent a filtering logic, so as to obtain the set number regular expression. The set numbers of different expression forms are used as parameters of the regular expression, so that the set numbers of various different expression forms can be defined in the generated set number regular expression.

Substep (2): and performing regular matching on the script to be processed by using the set number regular expression, and determining the set number and the position of the set number of each script contained in the script to be processed.

The regular matching refers to the operation of performing matching filtering on the content of the scenario to be processed by using a regular expression. For example, each character in the scenario to be processed may be matched with a character defined in the regular expression through the regular expression, and the matched character may be filtered through matching, so as to obtain the episode number and position of each episode included in the scenario to be processed.

Substep (3): and for any episode number, taking a text between the position of the episode number and the position of the next episode number as an episode represented by the episode number and dividing the episode to obtain a plurality of episodes included in the to-be-processed episode.

Since each episode number often indicates the start of an episode, in this step, texts between the position of the episode number and the position of the next episode number may be determined first, and then these texts are taken as an episode and divided, so as to obtain a plurality of episodes included in the episode to be processed. The next episode number refers to an episode number located only after the episode number in the writing order of the scenario to be processed. In the embodiment of the invention, the regular expression of the set number is generated according to the preset expression range of the set number, the set number is determined and segmented by the regular expression of the set number, and the regular expression of the set number can rapidly complete matching and filtering and define the set numbers of various different expression forms contained in the expression range of the set number, so that the efficiency and the accuracy rate of determining the set number can be improved to a certain extent, and the efficiency of dividing the episode can be improved. When there is set supplemental information, the supplemental information can be extracted and subjected to diversity to avoid omission during processing. Wherein, the set supplementary information refers to a set added for the script to be processed subsequently.

And 203, for at least one episode, determining a scene number and a position of the scene number contained in the episode according to a preset scene number expression range, and dividing the episode into a plurality of scene texts according to the scene number and the position of the scene number.

Specifically, the step can be realized by the following operations: generating a scene number regular expression according to a preset scene number expression range, wherein the scene number regular expression is defined with scene numbers contained in the scene number expression range, and then performing regular matching on the episode by using the scene number regular expression to determine the scene number of each scene contained in the episode and the position of the scene number; for any scene number, the text between the position of the scene number and the position of the scene number of the next scene is used as the scene text represented by the scene number and is divided, and a plurality of scene texts included in the episode are obtained. The details of each operation may refer to the related description in step 202, which is not limited by the embodiment of the present invention. Furthermore, in the embodiment of the invention, the scene number regular expression is generated according to the preset scene number expression range, the scene number is determined and segmented by the scene number regular expression, the scene number regular expression can rapidly complete matching and filtering and the scene numbers of a plurality of different expression forms contained in the scene number expression range are defined, so the scene number determination efficiency and the accuracy can be improved to a certain extent, and the dividing efficiency of the scene text can be improved, meanwhile, because the set number regular expression adopted in the embodiment of the invention is generated by using the set numbers of the plurality of different expression forms, and the scene number regular expression is generated by using the scene numbers of the plurality of different expression forms, when the set number regular expression and the scene number regular expression are used for matching, fuzzy matching can be achieved, namely, one condition is met, namely, the two conditions are determined to be matched, and compared with accurate matching, namely, the two conditions are determined to be matched only when both conditions are met, the format range of the processable script can be expanded to a certain extent. When scene supplementary information exists, the supplementary information can be extracted and divided into scenes to avoid omission during processing. The scene supplementary information refers to scene text added for the scenario to be processed subsequently.

Further, since the text may begin with a number, the determined scene number may be the number indicating a scene or may simply begin with the number of the text. Correspondingly, in the embodiment of the invention, after the scene text is divided, secondary judgment can be carried out to ensure the correctness of the divided scene text. The specific process of the secondary determination may be as follows: determining a text length of the scene text, and determining whether a scene number of the scene text is consecutive to a scene number of an adjacent scene text. Specifically, in determining the text length, the number of characters included in the scene text may be counted, and then the number of characters included may be determined as the text length. When determining whether the numbers are continuous, the absolute value of the difference between the scene number of the scene text and the scene number of the previous adjacent scene text may be counted, and the absolute value of the difference between the scene number of the scene text and the scene number of the next adjacent scene text may be counted. If the absolute value is equal to 1, it can be confirmed that the numbers are consecutive. Otherwise, the numbering may be considered discontinuous.

And if the text length is smaller than a preset length threshold value and the scene number of the scene text is not continuous with the scene number of the adjacent scene text, removing the scene text. The preset length threshold may be determined according to the lowest word count that may be included in the scene text representing a scene. If the text length is smaller than the preset length threshold, it may be considered that the scene text may not be the real scene text, and further, if the scene number of the scene text is not continuous with the scene number of the adjacent scene text, it may be further considered that the scene text may not be the real scene text. Therefore, when the text length is smaller than the preset length threshold and the number is not continuous, the scene text is confirmed not to be the real scene text, and the scene text is removed to ensure the accuracy. The removing of the scene text refers to removing the scene text from the divided scene text. Specifically, the scene text may be divided into body texts. In the embodiment of the invention, the scene text is secondarily judged after the scene text is divided, so that the accuracy of the scene text can be improved, and the subsequent processing effect on the scene text can be further ensured.

In order to improve the accuracy of the secondary determination, conditions that the scene text needs to satisfy may also be increased, for example, whether the scene text contains the scene information characters may be further determined, and the scene text is removed when the scene text does not contain the scene information characters and the text length is smaller than the preset length threshold and the number is discontinuous. Therefore, the scene text can be prevented from being removed by mistake to a certain extent, and the accuracy of secondary judgment is improved.

Correspondingly, in the embodiment of the invention, the episode can be secondarily judged so as to improve the accuracy of the episode obtained by division. The specific process of the secondary determination may be as follows: determining a text length of the episode, and determining whether a scene number of the episode is consecutive to a scene number of an adjacent episode. And if the text length is smaller than a preset length threshold value and the episode number of the episode is not continuous with the episode number of the adjacent episode, removing the episode. Specifically, the specific implementation process of each step may refer to the foregoing related description, and the embodiment of the present invention does not limit this. In the embodiment of the invention, the episode is subjected to secondary judgment after the episode is divided, so that the accuracy of the episode can be improved, and the subsequent processing effect on the episode can be further ensured.

Step 204, extracting scene information characters contained in the scene text for at least one scene text.

Specifically, this step can be realized by the following substeps (4) to (6):

substep (4): and traversing the scene text according to a preset scene information cue word range value to determine whether the scene text contains the scene information cue words.

In this step, the scene information cue word refers to a scene information character for indicating that the text content appearing subsequently is the presentation scene information. For example, some scripts are written with prompt words before time, place, people, and other information. For example, assume that the content in scenario A is "1-23 local weather: the rain occurrence place comprises school playground departure characters: the teacher's whole body, wherein the ' local weather ', ' place of occurrence ' and ' person on the scene ' are scene information prompting words. In this step, the preset scene information cue word range value may include a plurality of cue times, and the cue words in these cue times may be collected in advance. For example, the scene information cue words may be extracted from the sample scenario or commonly used scene information cue words may be searched from the network as the preset scene information cue words.

Further, in determining, each scene information cue word included in the scene information cue word range value may be compared to the respective words in the scene text. If the same words exist in the scene text, the scene text can be determined to contain the scene information cue words, and if the same words do not exist in the scene text, the scene text can be determined not to contain the scene information cue words. It should be noted that, when the scenario is written, description information is often placed in front of a scene text, where the description information is a text that embodies the scene information. Therefore, in the embodiment of the present invention, only the scene text may be traversed to determine whether the scene text includes the scene information cue word. Therefore, the text amount required to be traversed can be reduced, and processing resources are saved.

Substep (5): and if the scene text contains the scene information prompt words, determining characters adjacent to the scene information prompt words as scene information characters, and extracting the characters.

Because the characters which are often adjacent to each other after the scene information prompt words represent the scene information, in this step, under the condition that the scene information prompt words are determined to be contained in the scene text, the characters adjacent to the scene information prompt words can be directly extracted, and then the scene information characters are obtained.

Substep (6): if the scene text does not contain scene information cue words, dividing the scene text into a plurality of sub-texts; and extracting scene information characters from the sub-texts according to preset scene information character range values and/or the part of speech of the words in the sub-texts.

In this step, when the sub-texts are divided, the scene text may be equally divided into a plurality of sub-texts according to the fixed number of words, where one sub-text is a candidate character string to be processed. Alternatively, when a specific symbol appears, a division operation may be performed once, so as to obtain a plurality of sub-texts. The specific symbol may be a linefeed symbol, a tab symbol, a space, a comma, etc., which is not limited in the embodiment of the present invention.

Further, the scene information character range value may be a set containing characters to be used when representing the scene information. The preset scene information character range value may be pre-extracted from the sample scenario or pre-collected from the network. The scene information character range values may include common characters representing time, place, weather and person names, wherein the characters representing time, place, weather and person names may respectively belong to respective corresponding range values, and the characters included in the range values constitute the scene information character range values, wherein the range values representing time and place may store words approximately representing time and words approximately representing place, and accordingly, the scene information characters extracted based on the range values may be referred to as approximate time words and approximate place words. Of course, they may be mixed together to form the character range value of the scene information. The various range values, mapping relationships, sets, and regular expressions mentioned in the embodiments of the present invention may be read from the device in advance before use. Further, the characters representing the names of the persons may be extracted from the dialogue portion of the scenario to be processed according to a preset character dialogue format. The preset character dialog format may be set in advance according to the character dialog in the script. For example, the character spoken text in the scenario will often contain the name of the character, and therefore, the accuracy of the extracted name can be ensured to some extent by extracting the name from the spoken text part of the scenario to be processed in the character spoken text format. Specifically, when extracting the name, the name may be compared with the content in the scenario to be processed according to the preset character dialogue format, and then the content whose format matches the preset character dialogue format is determined as the dialogue portion of the scenario to be processed. Then, the words at the specific positions of the pair of white parts are extracted as the names of the persons. Wherein the specific location may be before a punctuation mark combination, which may be a colon and a double quotation mark.

Further, when extracting the scene information characters from the sub-texts according to the preset scene information character range value, the part-of-speech of the word included in the sub-text may be determined first. And then determining the words with the parts of speech as preset parts of speech and containing the specific characters as scene information characters, and extracting. Specifically, when determining the part of speech, the sub-text may be divided into a plurality of words, and then the part of speech corresponding to each word may be searched from the network. The predetermined part of speech and the specific word may be set according to actual conditions. For example, the predetermined part of speech may be a noun, the specific word may be a word representing a feature of a place, such as a hall, a building, a road, a house, etc., and a word representing a feature of a person, such as people, etc. Therefore, according to the part of speech and the mode of extracting the specific character, only part of single characters need to be collected in advance to serve as the realization basis, and the realization cost is low.

Further, when extracting the scene information characters from the subfile according to the preset scene information character range value, traversing the subfile according to the preset scene information character range value to determine whether the subfile contains the characters existing in the preset scene information character range value, and if so, determining the characters as the scene information characters and extracting the characters. Therefore, by traversing and extracting the sub-texts, the scene information characters can be directly extracted when the whole sub-text represents one scene information character. When the subfile consists of a plurality of scene information characters, the subfile can be accurately split into the scene information characters. For example, when the sub text is composed of a scene information character indicating time and a scene information character indicating a place, 2 scene information characters may be extracted. Or, the subfile is composed of two scene information characters representing a place, wherein one of the scene information characters represents a specific place and the other one represents an approximate place, and when the number of words contained in the place exceeds 1, the two scene information characters can be obtained through extraction.

Specifically, each character in the character range value of the scene information may be compared with each character in the subfile, and if there is a consistent character, the consistent character may be considered as a scene information character, and therefore, the extraction operation may be performed. Further, after the characters are determined as scene information characters and extracted, when the extracted scene information characters are characters representing names of people, characters adjacent to the scene information characters may be determined as characters representing names of people and extracted. Since the characters in the script are usually placed last when the characters appear in the same line at the time and there is no cue word. Therefore, if the scene information character is a character indicating a person name, it can be considered that the character following the person name is also the person name, and the extraction can be continued. Therefore, new scene information characters can be extracted quickly without other operations, and the extraction efficiency can be improved to a certain extent.

In the embodiment of the invention, whether the scene text contains the scene information cue words is determined according to the preset scene information cue word range value, and the scene information characters can be determined in a direct extraction mode under the condition that the scene text contains the scene information cue words, so that the execution of other necessary operations can be avoided, and the processing efficiency is improved. Meanwhile, under the condition that the scene text does not contain the scene information cue words, the mode of determining the scene information characters in the scene text is further determined according to the range values of the scene information characters and the word properties of words in the character strings. Of course, in another optional embodiment of the present invention, the extraction may also be performed only according to the scene information cue word, or only according to the scene information character range value and/or the part-of-speech of the word in the sub-text in the scene text, or a manner of extracting a part of the scene text according to the scene information cue word is employed, and a manner of extracting a part-of-speech of the word in the sub-text according to the scene information character range value and/or the part-of-speech of the word in the sub-text is employed for another part of the scene text.

For example, fig. 2 to 3 are schematic processing flow diagrams provided by an embodiment of the present invention, and as shown in fig. 2 to 3, diversity and scene-by-scene processing may be performed on the preprocessed script according to the set-number regular expression and the scene-number regular expression to obtain a plurality of scene texts. And then, determining whether the scene text contains the scene information cue words, extracting according to the scene information cue words under the condition of inclusion, and extracting according to the preset scene information character range value, the part of speech and the specific characters under the condition of no inclusion. And finally, obtaining scene information.

Step 205, determining the scene information characters contained in the scene text, the scene number of the scene text and the episode number of the episode to which the scene text belongs as the information to be sorted of the scene text.

Specifically, the step 104 may be referred to in the implementation manner of this step, which is not limited in this embodiment of the present invention.

And step 206, combining the information to be sorted of the scene text and the text in the scene text according to a preset form to form a target script.

Specifically, this step can be realized by the following substeps (7) to (8):

substep (7): and setting an information category identifier corresponding to the information category to which the information to be sorted belongs for the information to be sorted according to the information category to which the information to be sorted belongs, and setting a text identifier for the text.

In this step, the information category may include a set number, a scene number, time, location, person, weather, and the like. The information category identifier and the text identifier corresponding to the information category may be set preferentially according to actual requirements, which is not limited in the embodiment of the present invention. For example, the information category identification corresponding to the set number may be "epsilon _ id", the information category identification corresponding to the scene number may be "setting _ id", the information category identification corresponding to the time may be "time", the information category identification corresponding to the location may be "side", "location", and the like, and the information category identification corresponding to the weather may be "weather". The body text identification may be denoted as "content".

Specifically, when the identifier is set, the information category identifier may be used as a key, the scene information character may be used as a value, and a key value pair is formed to obtain the scene information character with the information category identifier set, and the text identifier may be used as a key, and the text may be used as a value, and a key value pair is formed to obtain the text with the text identifier set. It should be noted that, in the embodiment of the present invention, key value pairs corresponding to the scene information characters may also be stored in a preset storage area, so as to implement storing information to be sorted of different information types in a form of key value pairs.

Substep (8): and combining the information to be sorted with the information category identification and the text with the text identification.

In this step, the combination may be performed in a preset order. The preset sequence may be preset according to actual requirements. For example, the preset order may be set number scene information-scene number scene information-time scene information-location scene information-weather scene information-text.

Fig. 2-4 are schematic processing diagrams provided by an embodiment of the present invention, and as shown in fig. 2-4, the contents of an original scenario to be processed are:

1-44, political security administration outdoor sun

The rest of the qi will be tense and go into the mouth deep. "

For the scenario to be processed, the scenario to be processed may be preprocessed and then the scene information may be extracted, where the operations of preprocessing and scene extraction may be implemented based on a service dictionary and a regular expression. The service dictionary may be the various range values referred to in the foregoing step 201, and the regular expression may be the regular expression referred to in the foregoing step. Then, text normalization may be performed, that is, this step is performed, and finally, a script with a normalized format is obtained. By way of example, the content of the resulting transcript may be: { 'epsilon _ id': 1, 'setting _ id': 44, 'time': 'day', 'side': 'external', 'location': the 'political security administration gate', 'weather': ',' 'content': the rest is tense and deep enough to get into the mouth. '}

Further, in order to improve the quality of the text, in the embodiment of the present invention, the text may be further processed as follows: and detecting whether a line which contains words with the number which is not less than the preset line containing words and does not contain a tail punctuation mark at the tail of the line exists in the text of the text. The preset number of words contained in a row may be the maximum number of words that can be contained in a row in the scenario to be processed, and the ending punctuation mark refers to a punctuation mark that plays a role in indicating the end of a sentence, for example, a period. Further, if present, merging a next adjacent row of the row with the row if the row contains a number of words less than a first preset word number threshold and/or if the head word of the next adjacent row of the row does not represent a person's name. Specifically, if the number of words in a certain row does not reach the preset row-holding number of words and the end of the row does not contain the ending punctuation mark, it can be considered that an error broken row may occur here. To avoid the accuracy of the detection, it may be further determined whether the number of words contained in the row is less than a first preset word number threshold and whether the head word of the next adjacent row of the row represents a person's name. An erroneous line break at the end of a line may further be considered to have occurred if the number of words contained in the line is less than a first preset word number threshold and/or the first word of the next adjacent line of the line does not represent a person name. Therefore, the next adjacent row can be merged with the row, thereby realizing broken row repair. Wherein merging the next adjacent row with the row may be connecting a head of the next adjacent row with a tail of the row.

Further, a paragraph of which the number of words contained in the text is greater than a second preset word number threshold value can be detected; for each line in the paragraph, determining an end punctuation mark contained in the line, and dividing the text following the end punctuation mark contained in the line into the next line. The second preset word number threshold may be set according to the maximum number of words in a paragraph in the script, and if the number of words included in a paragraph is greater than the second preset word number threshold, it indicates that there may be a case that the line should be changed but the line is not changed in the paragraph. Therefore, the ending punctuation mark contained in each line in the paragraph can be further determined, and then the characters after the ending punctuation mark are divided into the next line, so as to realize the long sentence interruption. Wherein, dividing the text after the ending punctuation mark into the next line can be realized by adding a line feed character after the ending punctuation mark. In the embodiment of the invention, the text is more standard by performing line break repair and line break of long sentences on the text, so that the script is conveniently processed in the later period.

For example, fig. 2 to 5 are schematic diagrams illustrating a scene text according to an embodiment of the present invention, and as shown in fig. 2 to 5, the scene text may include scene information characters indicating time, place, person, and weather, and a body text, where the body text is subjected to body format normalization processing, that is, line breaking and line breaking, to obtain a normalized body text.

In summary, the scenario processing method provided in the embodiment of the present invention performs a preprocessing operation on the scenario to be processed first, so as to reduce interference factors in the scenario to be processed, and further improve subsequent processing effects to a certain extent. Then, the collection number and the position of the collection number contained in the scenario to be processed are determined according to the preset collection number expression range, and divides the scenario to be processed into a plurality of scenarios according to the included set number and the position of the set number, determining the scene numbers and the positions of the scene numbers contained in the episode according to the preset scene number expression range, dividing the episode into a plurality of scene texts according to the scene numbers and the positions of the scene numbers, for at least one scene text, extracting scene information characters contained in the scene text, determining the scene information characters contained in the scene text, the scene number of the scene text and the collection number of the episode to which the scene text belongs as information to be sorted of the scene text, and combining the information to be sorted of the scene text and the text in the scene text according to a preset form to form a target script. In the embodiment of the invention, the script to be processed is divided into the scene texts, and the single scene text is taken as the processing object for extraction, so that the coupling degree in the script can be reduced to a certain extent, the interference of the script format on the extraction of the scene information can be reduced, and the extraction accuracy is improved. Meanwhile, after the scene information is extracted, the scene texts are recombined according to a preset form, so that the internal forms of the scene texts in the script are kept consistent, and the script is convenient to process subsequently.

Fig. 3 is a block diagram of a scenario processing apparatus provided in an embodiment of the present invention, where the apparatus may be applied to an electronic device, and as shown in fig. 3, the apparatus 30 may include:

a first determining module 301, configured to determine, according to a preset collection number expression range, a collection number and a position of the collection number included in the scenario to be processed, and divide the scenario to be processed into multiple scenarios according to the position of the collection number included in the scenario to be processed.

A second determining module 303, configured to determine, for at least one episode, a scene number and a position of the scene number included in the episode according to a preset scene number expression range, and divide the episode into the plurality of scene texts according to the scene number and the position of the scene number.

An extracting module 303, configured to extract, for at least one of the scene texts, a scene information character included in the scene text.

A third determining module 304, configured to determine, as information to be sorted of the scene text, a scene information character included in the scene text, a scene number of the scene text, and a collection number of an episode to which the scene text belongs.

And the combining module 305 is configured to combine the information to be sorted of the scene text and the text in the scene text according to a preset form to form a target scenario.

Optionally, the extracting module 303 is specifically configured to:

and traversing the scene text according to a preset scene information cue word range value to determine whether the scene text contains the scene information cue words.

And if the scene text contains the scene information prompt words, determining characters adjacent to the scene information prompt words as scene information characters, and extracting the characters.

If the scene text does not contain scene information cue words, dividing the scene text into a plurality of sub-texts; and extracting scene information characters from the sub-texts according to preset scene information character range values and/or the part of speech of the words in the sub-texts.

Optionally, the extracting module 303 is further specifically configured to:

determining the part of speech of the words contained in the subfile; and determining the words with the parts of speech as preset parts of speech and containing the specific characters as scene information characters, and extracting.

And/or traversing the sub-text according to a preset scene information character range value to determine whether the sub-text contains characters existing in the preset scene information character range value; and if so, determining the character as a scene information character and extracting the scene information character.

Wherein, the character range value of the scene information at least comprises one of the following information: commonly used characters representing time, characters representing place, characters representing weather, and characters representing name of person.

Optionally, the first determining module 301 is specifically configured to:

generating a set number regular expression according to the preset set number expression range; the set number regular expression is defined with the set number contained in the set number expression range.

And performing regular matching on the script to be processed by using the set number regular expression, and determining the set number and the position of the set number of each script contained in the script to be processed.

Optionally, the second determining module 302 is specifically configured to:

and generating a scene number regular expression according to the preset scene number expression range. The scene number regular expression is defined with the scene number contained in the scene number expression range.

And performing regular matching on the episode by using the scene number regular expression, and determining the scene number of each scene contained in the episode and the position of the scene number.

Optionally, the apparatus 30 further includes:

and the preprocessing module is used for preprocessing the script to be processed.

Wherein the preprocessing operation comprises at least one of the following operations: and deleting the interference information in the scenario to be processed.

And converting the fonts in the script to be processed into preset fonts.

And converting punctuation marks in the script to be processed into punctuation marks corresponding to the punctuation marks in the symbol mapping relation according to a preset symbol mapping relation.

And deleting punctuation marks which do not belong to the available punctuation mark range value in the script to be processed according to a preset available punctuation mark range value.

Optionally, the combining module 305 is specifically configured to:

and setting an information category identifier corresponding to the information category to which the information to be sorted belongs for the information to be sorted according to the information category to which the information to be sorted belongs, and setting a text identifier for the text.

And combining the information to be sorted with the information category identification and the text with the text identification.

To sum up, in the scenario processing apparatus provided in the embodiment of the present invention, the first determining module may determine, according to a preset presentation range of the episode number, the episode number and the position of the episode number included in the scenario to be processed, and divide the scenario to be processed into a plurality of scenarios according to the included presentation range of the episode number and the position of the episode number, the second determining module may determine, according to the preset presentation range of the scene number, the scene number and the position of the scene number included in the scenario set, and divide the scenario into a plurality of scene texts according to the scene number and the position of the scene number, the extracting module may extract, for at least one of the scene texts, a scene information character included in the scene text, the scene number of the scene text, and the episode number of the scenario text belonging to the scenario text, and determine, as information to be arranged of the scene text, the combination module can combine the information to be sorted of the scene text and the text in the scene text according to a preset form to form the target script. In the embodiment of the invention, the script to be processed is divided into the scene texts, and the single scene text is taken as the processing object for extraction, so that the coupling degree in the script can be reduced to a certain extent, the interference of the script format on the extraction of the scene information can be reduced, and the extraction accuracy is improved. Meanwhile, after the scene information is extracted, the scene texts are recombined according to a preset form, so that the internal forms of the scene texts in the script are kept consistent, and the script is convenient to process subsequently.

For the above device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,

a memory 403 for storing a computer program;

the processor 401, when executing the program stored in the memory 403, implements the following steps:

determining a set number and a position of the set number contained in the script to be processed according to a preset set number expression range, and dividing the script to be processed into a plurality of scripts according to the set number and the position of the set number;

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute the scenario processing method described in any one of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the scenario processing method of any one of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A scenario processing method applied to an electronic device, the method comprising:

2. The method according to claim 1, wherein said extracting, for at least one of the scene texts, a scene information character contained in the scene text comprises:

traversing the scene text according to a preset scene information cue word range value to determine whether the scene text contains a scene information cue word;

if the scene text contains scene information prompt words, determining characters adjacent to the scene information prompt words as scene information characters, and extracting the characters;

3. The method according to claim 2, wherein the extracting scene information characters from the sub-text according to preset scene information character range values and/or parts of speech of words in the sub-text comprises:

determining the part of speech of the words contained in the subfile; determining words with parts of speech as preset parts of speech and containing specific characters as scene information characters, and extracting the words;

and/or traversing the sub-text according to a preset scene information character range value to determine whether the sub-text contains characters existing in the preset scene information character range value; if yes, determining the character as a scene information character, and extracting;

4. The method according to claim 1, wherein the determining the episode number included in the scenario to be processed and the position of the episode number according to a preset episode number expression range comprises:

generating a set number regular expression according to the preset set number expression range; the set number regular expression is defined with set numbers contained in the set number expression range;

5. The method according to claim 1, wherein the determining the scene number and the position of the scene number included in the episode according to a preset scene number expression range comprises:

generating a scene number regular expression according to the preset scene number expression range; scene numbers contained in the scene number expression range are defined in the scene number regular expression;

6. The method of claim 1, further comprising:

carrying out pretreatment operation on the script to be treated;

wherein the preprocessing operation comprises at least one of the following operations:

deleting the interference information in the scenario to be processed;

converting the fonts in the script to be processed into preset fonts;

converting punctuation marks in the script to be processed into punctuation marks corresponding to the punctuation marks in the symbol mapping relation according to a preset symbol mapping relation;

7. The method according to claim 1, wherein the combining the information to be collated of the scene text and the body text in the scene text according to a preset form comprises:

according to the information category to which the information to be sorted belongs, setting an information category identifier corresponding to the information category to which the information to be sorted belongs for the information to be sorted, and setting a text identifier for the text;

8. A scenario processing apparatus applied to an electronic device, the apparatus comprising:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1 to 7 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.