CN112995719A - Bullet screen text-based problem set acquisition method and device and computer equipment - Google Patents

Bullet screen text-based problem set acquisition method and device and computer equipment Download PDF

Info

Publication number
CN112995719A
CN112995719A CN202110430212.3A CN202110430212A CN112995719A CN 112995719 A CN112995719 A CN 112995719A CN 202110430212 A CN202110430212 A CN 202110430212A CN 112995719 A CN112995719 A CN 112995719A
Authority
CN
China
Prior art keywords
target
bullet screen
text
time
screen text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110430212.3A
Other languages
Chinese (zh)
Other versions
CN112995719B (en
Inventor
许丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110430212.3A priority Critical patent/CN112995719B/en
Publication of CN112995719A publication Critical patent/CN112995719A/en
Application granted granted Critical
Publication of CN112995719B publication Critical patent/CN112995719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/06Electrically-operated educational appliances with both visual and audible presentation of the material to be studied
    • G09B5/065Combinations of audio and video presentations, e.g. videotapes, videodiscs, television systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a bullet screen text-based question set acquisition method, a bullet screen text-based question set acquisition device, computer equipment and a storage medium, and relates to an artificial intelligence technology. The method has the advantages that the core problem set in the video bullet screen text is extracted through the natural language processing technology and the time sequence inflection point technology, the core problem set in the bullet screen text can be quickly acquired without the need of a user to review the whole video, and the extraction and positioning efficiency of the hot-spot question bullet screen text is improved.

Description

Bullet screen text-based problem set acquisition method and device and computer equipment
Technical Field
The invention relates to the technical field of artificial intelligence intelligent decision making, in particular to a bullet screen text-based problem set acquisition method and device, computer equipment and a storage medium.
Background
The traditional off-line teaching is naturally beneficial to the establishment of positive cycles of output, feedback and optimization of both teaching parties. However, with the development of internet technology, more and more teaching contents are gradually transferred from off-line to on-line. Particularly in the enterprise training scene, in order to break time and regional limitations, the video training mode begins to take more weight in enterprise training.
Bullet screen information is a valuable data resource in a training scene and comprises a large amount of classroom instant feedback, evaluation, question asking and the like. Currently, the video platforms, especially educational platforms, still have limited use of barrage. In the live course, the instructor can pay attention to the bullet screen messages in real time to answer the questions specifically, and the bullet screen information in the recorded and played course is generally difficult to be fully utilized.
In the recorded and broadcast course, because the bullet screen content generally lacks context information, even if the bullet screen list is saved, the instructor cannot easily correspond the left message and the explanation content, and the corresponding bullet screen evaluation content of the explanation content can be obtained only by reviewing the video again, so that the extraction and positioning efficiency of the hot bullet screen text is low.
Disclosure of Invention
The embodiment of the invention provides a question set acquisition method and device based on a bullet screen text, computer equipment and a storage medium, and aims to solve the problem that in a recorded and broadcast video course in the prior art, due to the fact that after bullet screen contents are stored, corresponding bullet screen evaluation contents of explanation contents can be obtained only by reviewing videos again, extraction and positioning efficiency of hot-spot question-asking bullet screen texts is low.
In a first aspect, an embodiment of the present invention provides a method for acquiring a question set based on a bullet screen text, including:
acquiring a bullet screen text data set of the selected target video data in the last bullet screen acquisition period;
inputting each bullet screen text data in the bullet screen text data set to a pre-trained text sentence type identification model to obtain a text sentence type corresponding to each bullet screen text data, and acquiring bullet screen text data of which the text sentence type in the bullet screen text data set is an question sentence to form a target bullet screen text data set;
according to the bullet screen sending time corresponding to each piece of target bullet screen text data in the target bullet screen text data set and a plurality of time segments corresponding to the target video data, counting the number of target bullet screen text data corresponding to each time segment according to the ascending time sequence to form a problem number time sequence;
performing inflection point detection on the problem quantity time series to obtain an inflection point detection result set;
acquiring rising inflection points in the inflection point detection result set and time division sections corresponding to the rising inflection points, and combining the time division sections corresponding to the rising inflection points to obtain a target time division section set;
respectively performing text clustering on target bullet screen text data subsets corresponding to each target time partition in the target time partition set to obtain a text clustering result corresponding to each target time partition;
acquiring clustering texts of which the descending ranking of the text clustering quantity in each text clustering result does not exceed a preset ranking threshold value, and forming target clustering text subsets respectively corresponding to each text clustering result; and
and acquiring a time period, time period video data and a target clustering text subset corresponding to each target time division segment, forming a mixed data set corresponding to each target time division segment, and sending the mixed data set to a target user side.
In a second aspect, an embodiment of the present invention provides a problem set obtaining apparatus based on a bullet screen text, including:
the bullet screen data set acquisition unit is used for acquiring a bullet screen text data set of the selected target video data in the last bullet screen acquisition period;
a target barrage text acquiring unit, configured to input each barrage text data in the barrage text data set to a pre-trained text sentence type identification model, obtain a text sentence type corresponding to each barrage text data, and acquire barrage text data in the barrage text data set, where the text sentence type is a question sentence, to form a target barrage text data set;
a problem quantity time sequence obtaining unit, configured to count, according to the bullet screen sending time corresponding to each piece of target bullet screen text data in the target bullet screen text data set and a plurality of time segments corresponding to the target video data, the quantity of target bullet screen text data corresponding to each time segment in an ascending time order, and form a problem quantity time sequence;
the inflection point detection unit is used for carrying out inflection point detection on the problem quantity time sequence to obtain an inflection point detection result set;
a target time partition set obtaining unit, configured to obtain rising inflection points in the inflection point detection result set and a time partition corresponding to each rising inflection point, and combine the time partitions corresponding to each rising inflection point to obtain a target time partition set;
the text clustering result acquisition unit is used for respectively carrying out text clustering on the target bullet screen text data subsets corresponding to each target time division segment in the target time division segment set to obtain a text clustering result corresponding to each target time division segment;
the target clustering text subset acquisition unit is used for acquiring clustering texts of which the descending ranking of the text clustering quantity in each text clustering result does not exceed the corresponding ranking threshold value, and forming target clustering text subsets respectively corresponding to each text clustering result; and
and the mixed data set acquisition unit is used for acquiring the time period, the time period video data and the target clustering text subset corresponding to each target time division segment, forming a mixed data set corresponding to each target time division segment, and sending the mixed data set to the target user side.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for acquiring a question set based on a bullet screen text according to the first aspect when executing the computer program.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the method for acquiring a question set based on a bullet screen text according to the first aspect.
The embodiment of the invention provides a bullet screen text-based problem set acquisition method, a bullet screen text-based problem set acquisition device, computer equipment and a storage medium. The method has the advantages that the core problem set in the video bullet screen text is extracted through the natural language processing technology and the time sequence inflection point technology, the core problem set in the bullet screen text can be quickly acquired without the need of a user to review the whole video, and the extraction and positioning efficiency of the hot-spot question bullet screen text is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a question set obtaining method based on a bullet screen text according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a problem set obtaining method based on a bullet screen text according to an embodiment of the present invention;
fig. 3 is a schematic block diagram of a problem set obtaining apparatus based on bullet screen text according to an embodiment of the present invention;
FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a bullet screen text-based problem set acquisition method according to an embodiment of the present invention; fig. 2 is a schematic flowchart of a method for acquiring a question set based on a bullet screen text according to an embodiment of the present invention, where the method is applied to a server and is executed by application software installed in the server.
As shown in FIG. 2, the method includes steps S101 to S108.
S101, acquiring a bullet screen text data set of the selected target video data in the last bullet screen acquisition period.
In this embodiment, in order to more clearly understand the technical solution of the present application, the following detailed description is made on the terminal concerned. The technical scheme is described in the perspective of a server.
The first is that the server stores a large amount of recorded and broadcast type video data (generally, teaching videos of training type, each teaching video corresponds to one instructor), and the user can select one or more recorded and broadcast type video data to watch after logging in the server, and can send the bullet screen text at any time point when watching the recorded and broadcast type video data. After receiving a large amount of barrage texts sent by users aiming at certain recorded broadcast type video data, the server can extract question type barrage texts by using the text statement types and the text contents of the barrage texts through NLP (natural language processing technology) and time sequence inflection point detection technology, an instructor corresponding to the recorded broadcast type video data makes various forms of supplementary explanations aiming at the question contents, then generates supplementary material links and adds the supplementary material links into video pictures in corresponding time periods, and a feedback circulation mechanism is established for teachers and students through the method.
And secondly, a plurality of clients can establish communication connection with the server and watch the recorded and broadcast type video data on line, for example, the client A can select one or more recorded and broadcast type video data for watching after logging in the server and can send the bullet screen text at any time point when watching the recorded and broadcast type video data.
The target video data is recorded and broadcast type video data when being implemented specifically, and the recorded and broadcast type video data is characterized in that a user can open the recorded and broadcast type video data at any time to watch the recorded and broadcast type video data, a barrage text can be edited and sent when the user watches the recorded and broadcast type video data at a certain time point so as to be displayed on a display interface corresponding to the recorded and broadcast type video data, and the barrage text moves from one side of the display interface to the other side of the display interface at a certain moving speed until the barrage text is not displayed.
Multiple users can send the barrage text for the same recording type video data, so that the server can collect all the barrage texts sent by the multiple users in a period of time (for example, the last barrage collection period can be understood as the last natural week). At this time, under the condition that the plurality of listed users can send the barrage text for the same recording type video data, the recording type video data can be understood as target video data (also called target video data) to be analyzed selected by the server, and the last barrage collection period can be understood as the last natural week.
S102, inputting each bullet screen text data in the bullet screen text data set to a pre-trained text sentence type recognition model to obtain a text sentence type corresponding to each bullet screen text data, and obtaining bullet screen text data of which the text sentence type in the bullet screen text data set is a question sentence to form a target bullet screen text data set.
In this embodiment, after the server collects all the bullet screen texts of the target video data in the last bullet screen collection period, a bullet screen text data set is formed. In the barrage text data set, each barrage text data at least includes the following attributes: the method includes the steps of firstly, playing the bullet screen text (for example, how a certain bullet screen text is understood by the knowledge point a), secondly, playing the bullet screen by taking a video playing time axis as a reference (for example, how the knowledge point a in the above example understands that the bullet screen text is sent from the playing of the target video data to the 8 th minute), and thirdly, actually sending the bullet screen text by taking the system time as a reference (for example, how the knowledge point a in the above example understands that the bullet screen text is sent in 2018, 12, 1, 8: 00).
In an embodiment, the text statement type recognition model is a support vector machine classification model, and step S102 further includes:
acquiring a historical bullet screen text set as a sample set;
obtaining a sentence vector corresponding to each historical bullet screen text in the historical bullet screen text set;
acquiring a text statement type marking value corresponding to each historical bullet screen text; wherein, the bullet screen text label value of the text statement type of question sentence is 1, and the bullet screen text label value of the text statement type of non-question sentence is 0;
and taking the sentence vector corresponding to each historical bullet screen text as the input of the classification model of the support vector machine to be trained, taking the label value corresponding to the sentence vector as the output of the classification model of the support vector machine to be trained, training the classification model of the support vector machine to be trained to obtain the classification model of the support vector machine, and acquiring the classification hyperplane corresponding to the classification model of the support vector machine.
In this embodiment, a text sentence type recognition model may be trained in advance in the server, so as to recognize a sentence type of each bullet screen text data in the bullet screen text data set. The simplest implementation manner of the text sentence type recognition model is to recognize only whether bullet screen text data is an question sentence or a non-question sentence, that is, a binary classification model (for example, a classification model of a support vector machine) is adopted, for example, the text sentence type recognition model specifically uses a bag-of-words model to construct a sentence vector to train an SVM classification model. During specific implementation, a batch of historical bullet screen texts can be marked in advance, word segmentation processing is carried out on the historical bullet screen texts, and a word bag model is used for constructing sentence vectors to train the SVM classification model.
After the bullet screen texts of the question types in the bullet screen text data sets are identified, the bullet screen texts of the question types form a target bullet screen text data set, namely, the question type bullet screen texts are screened and reserved.
S103, according to the bullet screen sending time corresponding to each piece of target bullet screen text data in the target bullet screen text data set and a plurality of time division sections corresponding to the target video data, counting the number of the target bullet screen text data corresponding to each time division section according to the ascending order of time to form a problem number time sequence.
In this embodiment, each piece of target bullet screen text data included in the target bullet screen text data set is a data attribute including bullet screen sending time, at this time, time period division may be performed by using a video playing time axis corresponding to the target video data, and then the number of target bullet screen text data corresponding to each time division period is counted according to an ascending time sequence to form a problem number time sequence. For example, a time window value (e.g., a time window value of 3-5 s) may be preset to divide the target video duration corresponding to the target video data, so as to obtain a plurality of time division segments. By processing the number of bullet screen texts corresponding to each time period into a question number time sequence, which time periods have more questions can be further analyzed, and therefore a targeted solution can be made.
In one embodiment, step S103 includes:
dividing the target video time length corresponding to the target video data according to a preset time window value to obtain a time division segment set corresponding to the target video time length;
counting and acquiring a target bullet screen text data subset corresponding to each time partition in the time partition set according to bullet screen sending time corresponding to each target bullet screen text data in the target bullet screen text data set;
and counting the number of the bullet screens corresponding to each target bullet screen text data subset in sequence according to the time ascending sequence of the time partition corresponding to each target bullet screen text data subset to form a problem number time sequence.
In this embodiment, the time window value may be set to 3-5s, for example, the time window value may be set to 4s, and the target video duration corresponding to the target video data is 1800s, then the target video duration corresponding to the target video data may be divided into 1800/4=475 time segments according to the set time window value, that is, 475 time segments are included in the time segment set corresponding to the target video duration. Because each piece of target bullet screen text data in the target bullet screen text data set has the attribute value of bullet screen sending time, the target bullet screen text data subsets corresponding to each time partition in the time partition set can be obtained according to the time ascending sequence of the time partitions and the statistics of the attribute value of bullet screen sending time. And finally, counting the number of the bullet screens corresponding to each target bullet screen text data subset in sequence, wherein the number of the bullet screens which are arranged in ascending order according to the time sequence forms a problem number time sequence. By the aid of the dividing mode, target video data with longer video duration can be divided into finer granularity, and more accurate data analysis can be performed.
And S104, performing inflection point detection on the problem quantity time series to obtain an inflection point detection result set.
In this embodiment, when performing inflection point detection on the problem quantity time series, there are a variety of mature methods for inflection point detection, for example, a Binary Segmentation method may be adopted, and a linear loss function is selected. Due to the variable number of corners, a penalty term needs to be added to the loss function, for example, an L _0 penalty term is selected to balance the complexity and the fitting degree of the model when different numbers of corners are selected.
In one embodiment, step S104 includes:
calling a prestored bisection segmentation model, and acquiring a loss function corresponding to the bisection segmentation model;
and obtaining a target problem quantity value which meets the condition that the total loss corresponding to all the problem quantity values is the minimum value in the problem quantity time sequence through a binary segmentation model, and forming an inflection point detection result set.
In the present embodiment, the binary segmentation model corresponds to a sequential greedy algorithm, and in each iteration, the detection of a single change point is performed and an estimation value is generated. E.g. first estimated inflection point
Figure 891262DEST_PATH_IMAGE001
Comprises the following steps:
Figure 464195DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure 102987DEST_PATH_IMAGE003
represents the time point 0 and the time point tRepresenting the time sequence between the point in time T and the point in time T, and the argmin () function is an operation that makes the expression in parentheses the minimum. Namely, an estimated inflection point is obtained after the first binary segmentation, then iteration is carried out to the Kth time (K is a preset expected iteration time) by referring to a mode of obtaining the first inflection point, and a target problem quantity value which meets the condition that the total loss corresponding to all the problem quantity values is the minimum value in the problem quantity time sequence is obtained through a binary segmentation model to form an inflection point detection result set.
And S105, acquiring rising inflection points in the inflection point detection result set and time division sections corresponding to the rising inflection points, and combining the time division sections corresponding to the rising inflection points to obtain a target time division section set.
In this embodiment, rising inflection points (the value of the number of questions corresponding to a rising inflection point is generally the maximum value in a certain sequence of intervals) can be detected through inflection point detection, each rising inflection point corresponds to one time partition, after time partitions corresponding to all rising inflection points are found, a target time partition set is composed of the time partitions, and all the time partitions included in the target time partition set can be understood as time periods in the question set.
And S106, respectively carrying out text clustering on the target bullet screen text data subsets corresponding to each target time partition in the target time partition set to obtain a text clustering result corresponding to each target time partition.
In this embodiment, the query contents of the target time segments with highly concentrated query degrees are respectively subjected to cluster analysis. In the step, in order to avoid the problem of feeding back the instructor about the repeatability of a time segment, bullet screen texts with similar semantics need to be merged (because the bullet screen texts have the condition that different words have the same semantics), the bullet screen texts to be clustered are firstly participled to obtain bullet screen participles, then, embedding vectorization is carried out, then, word vectors are used for synthesizing sentence vectors, a mixed Gaussian model (GMM) is used for clustering texts, and then, a core question set segmented aiming at each question hot point time is found.
In one embodiment, step S106 includes:
acquiring a target bullet screen text included in an ith group of target bullet screen text data subsets in an ith target time division segment; wherein the initial value of i is 1, the value range of i is [1, k ], and the value of k is equal to the total number of the target time division sections included in the target time division section set;
obtaining sentence vectors respectively corresponding to target bullet screen texts included in the ith group of target bullet screen text data subsets;
clustering each sentence vector corresponding to the ith group of target bullet screen text data subsets according to a pre-trained Gaussian mixture model to obtain an ith group of text clustering results corresponding to the ith group of target bullet screen text data subsets;
increasing the value of i by 1 and updating the value of i, and judging that i exceeds k; if i does not exceed k, returning to execute the step of obtaining the target bullet screen text included in the ith group of target bullet screen text data subset in the ith target time division segment;
if i exceeds k, the process ends.
In this embodiment, after k groups of target bullet screen text data subsets are obtained, text clustering needs to be performed on each group of target bullet screen text data subsets in sequence. In the text clustering process, for example, taking the 1 st group of target bullet screen text data subsets in the 1 st target time partition segment as an example, first, each target bullet screen text in the 1 st group of target bullet screen text data subsets is correspondingly converted into a sentence vector (the conversion of the text into the sentence vector is prior art and is not repeated here), and then, the sentence vectors respectively corresponding to each target bullet screen text are input into a gaussian mixture model for clustering, so as to obtain the corresponding 1 st group of text clustering results.
The gaussian mixture model is an object precisely quantized by using a gaussian probability density distribution function, and can be generally used for solving the problem that data in the same set comprises a plurality of different distributions, and the data in the same type of distribution corresponds to the same gaussian probability density distribution function. Moreover, the use of a Gaussian mixture model for text clustering is similar to the K-means clustering (i.e., K-means clustering) approach.
S107, obtaining the clustering texts of which the descending ranking of the text clustering quantity in each text clustering result does not exceed the corresponding ranking threshold value, and forming target clustering text subsets respectively corresponding to each text clustering result.
In this embodiment, when the core problem set of each question hot spot time partition is screened, the cluster text corresponding to the top 3 of the cluster number in each question hot spot time partition may be selected, so as to form the core problem set corresponding to each question hot spot time partition. In this way, the number of problems is effectively reduced, so that the final result focuses on the core problem set.
And S108, acquiring a time period, time period video data and a target clustering text subset corresponding to each target time division segment, forming a mixed data set corresponding to each target time division segment, and sending the mixed data set to a target user side.
In this embodiment, the time period video data, and the target clustered text subset (the target clustered text subset may be understood as a core question set) corresponding to each of the questioning hot time division segments may be combined to obtain a mixed data set, and the mixed data set is returned to the target user terminal used by the instructor. Thus, the lecturer can acquire a group of classroom contents with high doubt degree without reviewing the whole video. (if the teaching is not presented in the form of PPT or blackboard-writing combined explanation, the explanation content cannot be reflected in the picture, the audio frequency in a period of time before and after the time node can be subjected to character conversion and then keywords can be refined, and the keywords are fed back to the instructor in the form of replacing the video picture.)
In an embodiment, step S108 is followed by:
and receiving reply data which is sent by a target user side and respectively corresponds to each mixed data set, adding the reply data of each mixed data set to the time period video data of the corresponding target time division segment, and obtaining the reply video data corresponding to the target video data.
In this embodiment, the instructor uploads reply data (which may be understood as supplementary data that can help the user watching the video to answer the question) for each mixed data set.
In an embodiment, the adding the answer data of each mixed data set to the time-segment video data of the corresponding target time segment to obtain the answer video data corresponding to the target video data includes:
and adding the original text data or hyperlink address of the reply data of each mixed data set to the time period video data of the corresponding target time division segment to obtain the corresponding reply video data.
In this embodiment, the supplementary data uploaded by the target user side may be original text data or a hyperlink address corresponding to the original text data, so that other users watching the answering video data can directly click and view the supplementary data, or a hyperlink address corresponding to the answering video data is provided for the users to click and skip to view the supplementary data. After the instructor operates the target user side to upload the supplementary materials, corresponding supplementary material links are displayed in the video picture within a period of time from the time division of each extracted question hot spot, so that students can conveniently and directly conduct emphatic learning aiming at the current difficulty.
According to the method, the core problem set in the video bullet screen text is extracted through a natural language processing technology and a time sequence inflection point technology, a user does not need to review the whole video, the core problem set in the bullet screen text can be rapidly obtained, and the extraction and positioning efficiency of the hot-spot question bullet screen text is improved.
The embodiment of the invention also provides a bullet screen text-based problem set acquisition device, which is used for executing any embodiment of the bullet screen text-based problem set acquisition method. Specifically, please refer to fig. 3, fig. 3 is a schematic block diagram of a bullet-screen-text-based problem set acquisition apparatus according to an embodiment of the present invention. The bullet screen text-based problem set acquisition apparatus 100 may be configured in a server.
As shown in fig. 3, the problem set acquisition apparatus 100 based on the bullet screen text includes: a bullet screen data set acquisition unit 101, a target bullet screen text acquisition unit 102, a problem quantity time sequence acquisition unit 103, an inflection point detection unit 104, a target time partition set acquisition unit 105, a text clustering result acquisition unit 106, a target clustered text subset acquisition unit 107, and a mixed data set acquisition unit 108.
A bullet screen data set obtaining unit 101, configured to obtain a bullet screen text data set of the selected target video data in a previous bullet screen collecting period.
In this embodiment, the target video data is recorded and broadcast type video data when implemented specifically, and the recorded and broadcast type video data is characterized in that a user can open the recorded and broadcast type video data at any time to watch the video data, and when the user watches the recorded and broadcast type video data at a certain time point, a bullet screen text can be edited and sent to be displayed on a display interface corresponding to the recorded and broadcast type video data, and the bullet screen text moves from one side of the display interface to the other side of the display interface at a certain moving speed until the bullet screen text is not displayed.
Multiple users can send the barrage text for the same recording type video data, so that the server can collect all the barrage texts sent by the multiple users in a period of time (for example, the last barrage collection period can be understood as the last natural week). At this time, under the condition that the plurality of listed users can send the barrage text for the same recording type video data, the recording type video data can be understood as target video data (also called target video data) to be analyzed selected by the server, and the last barrage collection period can be understood as the last natural week.
And a target bullet screen text acquiring unit 102, configured to input each piece of bullet screen text data in the bullet screen text data set to a pre-trained text sentence type identification model, obtain a text sentence type corresponding to each piece of bullet screen text data, and acquire bullet screen text data in the bullet screen text data set in which the text sentence type is an question sentence to form a target bullet screen text data set.
In this embodiment, after the server collects all the bullet screen texts of the target video data in the last bullet screen collection period, a bullet screen text data set is formed. In the barrage text data set, each barrage text data at least includes the following attributes: the method includes the steps of firstly, playing the bullet screen text (for example, how a certain bullet screen text is understood by the knowledge point a), secondly, playing the bullet screen by taking a video playing time axis as a reference (for example, how the knowledge point a in the above example understands that the bullet screen text is sent from the playing of the target video data to the 8 th minute), and thirdly, actually sending the bullet screen text by taking the system time as a reference (for example, how the knowledge point a in the above example understands that the bullet screen text is sent in 2018, 12, 1, 8: 00).
In an embodiment, the text sentence type recognition model is a support vector machine classification model, and the bullet screen text-based problem set obtaining apparatus 100 further includes:
the sample set acquisition unit is used for acquiring a historical bullet screen text set as a sample set;
a sentence vector obtaining unit, configured to obtain a sentence vector corresponding to each historical bullet screen text in the historical bullet screen text set;
the statement type marking unit is used for acquiring a text statement type marking value corresponding to each historical bullet screen text; wherein, the bullet screen text label value of the text statement type of question sentence is 1, and the bullet screen text label value of the text statement type of non-question sentence is 0;
and the support vector machine training unit is used for taking the sentence vector corresponding to each historical bullet screen text as the input of the support vector machine classification model to be trained, taking the label value corresponding to the sentence vector as the output of the support vector machine classification model to be trained, training the support vector machine classification model to be trained, obtaining the support vector machine classification model, and obtaining the classification hyperplane corresponding to the support vector machine classification model.
In this embodiment, a text sentence type recognition model may be trained in advance in the server, so as to recognize a sentence type of each bullet screen text data in the bullet screen text data set. The simplest implementation manner of the text sentence type recognition model is to recognize only whether bullet screen text data is an question sentence or a non-question sentence, that is, a binary classification model (for example, a classification model of a support vector machine) is adopted, for example, the text sentence type recognition model specifically uses a bag-of-words model to construct a sentence vector to train an SVM classification model. During specific implementation, a batch of historical bullet screen texts can be marked in advance, word segmentation processing is carried out on the historical bullet screen texts, and a word bag model is used for constructing sentence vectors to train the SVM classification model.
After the bullet screen texts of the question types in the bullet screen text data sets are identified, the bullet screen texts of the question types form a target bullet screen text data set, namely, the question type bullet screen texts are screened and reserved.
A problem quantity time sequence obtaining unit 103, configured to count, according to the bullet screen sending time corresponding to each piece of target bullet screen text data in the target bullet screen text data set and a plurality of time segments corresponding to the target video data, the quantity of target bullet screen text data corresponding to each time segment in an ascending time order, and form a problem quantity time sequence.
In this embodiment, each piece of target bullet screen text data included in the target bullet screen text data set is a data attribute including bullet screen sending time, at this time, time period division may be performed by using a video playing time axis corresponding to the target video data, and then the number of target bullet screen text data corresponding to each time division period is counted according to an ascending time sequence to form a problem number time sequence. For example, a time window value (e.g., a time window value of 3-5 s) may be preset to divide the target video duration corresponding to the target video data, so as to obtain a plurality of time division segments. By processing the number of bullet screen texts corresponding to each time period into a question number time sequence, which time periods have more questions can be further analyzed, and therefore a targeted solution can be made.
In one embodiment, the problem number time series obtaining unit 103 includes:
the window dividing unit is used for dividing the target video time length corresponding to the target video data according to a preset time window value to obtain a time division segment set corresponding to the target video time length;
a target bullet screen text data subset obtaining unit, configured to statistically obtain a corresponding target bullet screen text data subset in each time partition in the time partition set according to bullet screen sending time corresponding to each target bullet screen text data in the target bullet screen text data set;
and the sequence value acquisition unit is used for counting the bullet screen quantity corresponding to each target bullet screen text data subset in sequence according to the time ascending sequence of the time partition corresponding to each target bullet screen text data subset to form a problem quantity time sequence.
In this embodiment, the time window value may be set to 3-5s, for example, the time window value may be set to 4s, and the target video duration corresponding to the target video data is 1800s, then the target video duration corresponding to the target video data may be divided into 1800/4=475 time segments according to the set time window value, that is, 475 time segments are included in the time segment set corresponding to the target video duration. Because each piece of target bullet screen text data in the target bullet screen text data set has the attribute value of bullet screen sending time, the target bullet screen text data subsets corresponding to each time partition in the time partition set can be obtained according to the time ascending sequence of the time partitions and the statistics of the attribute value of bullet screen sending time. And finally, counting the number of the bullet screens corresponding to each target bullet screen text data subset in sequence, wherein the number of the bullet screens which are arranged in ascending order according to the time sequence forms a problem number time sequence. By the aid of the dividing mode, target video data with longer video duration can be divided into finer granularity, and more accurate data analysis can be performed.
An inflection point detecting unit 104, configured to perform inflection point detection on the problem quantity time series to obtain an inflection point detection result set.
In this embodiment, when performing inflection point detection on the problem quantity time series, there are a variety of mature methods for inflection point detection, for example, a Binary Segmentation method may be adopted, and a linear loss function is selected. Due to the variable number of corners, a penalty term needs to be added to the loss function, for example, an L _0 penalty term is selected to balance the complexity and the fitting degree of the model when different numbers of corners are selected.
In one embodiment, the inflection point detecting unit 104 includes:
the model acquisition unit is used for calling a prestored binary division model and acquiring a loss function corresponding to the binary division model;
and the inflection point determining unit is used for acquiring a target problem quantity value which meets the condition that the total loss corresponding to all the problem quantity values is the minimum value in the problem quantity time sequence through a binary segmentation model, and forming an inflection point detection result set.
In the present embodiment, the binary segmentation model corresponds to a sequential greedy algorithm, and in each iteration, the detection of a single change point is performed and an estimation value is generated. E.g. first estimated inflection point
Figure 625759DEST_PATH_IMAGE001
Comprises the following steps:
Figure 361503DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure 777441DEST_PATH_IMAGE003
representing a time sequence between time point 0 and time point t,
Figure 965846DEST_PATH_IMAGE004
representing the time sequence between the point in time T and the point in time T, the argmin () function is an operation that makes the expression in parentheses the minimum. Namely, an estimated inflection point is obtained after the first binary segmentation, then iteration is carried out to the Kth time (K is a preset expected iteration time) by referring to a mode of obtaining the first inflection point, and a target problem quantity value which meets the condition that the total loss corresponding to all the problem quantity values is the minimum value in the problem quantity time sequence is obtained through a binary segmentation model to form an inflection point detection result set.
And a target time partition set obtaining unit 105, configured to obtain the rising inflection points in the inflection point detection result set and the time partition corresponding to each rising inflection point, and combine the time partitions corresponding to each rising inflection point to obtain a target time partition set.
In this embodiment, rising inflection points (the value of the number of questions corresponding to a rising inflection point is generally the maximum value in a certain sequence of intervals) can be detected through inflection point detection, each rising inflection point corresponds to one time partition, after time partitions corresponding to all rising inflection points are found, a target time partition set is composed of the time partitions, and all the time partitions included in the target time partition set can be understood as time periods in the question set.
And the text clustering result obtaining unit 106 is configured to perform text clustering on the target bullet screen text data subsets corresponding to each target time partition in the target time partition set, respectively, to obtain a text clustering result corresponding to each target time partition.
In this embodiment, the query contents of the target time segments with highly concentrated query degrees are respectively subjected to cluster analysis. In order to avoid the problem of feeding back the instructor about the repeatability of a time segment, bullet screen texts with similar semantics need to be merged (because the bullet screen texts have the condition that words with different semantics are the same), the bullet screen texts to be clustered are firstly participled to obtain bullet screen participles, then embedding vectorization is carried out, then word vectors are used for synthesizing sentence vectors, a mixed Gaussian model (GMM) is used for clustering texts, and then a core question set segmented for each question-asking hotspot time is found.
In an embodiment, the text clustering result obtaining unit 106 is further configured to:
acquiring a target bullet screen text included in an ith group of target bullet screen text data subsets in an ith target time division segment; wherein the initial value of i is 1, the value range of i is [1, k ], and the value of k is equal to the total number of the target time division sections included in the target time division section set;
obtaining sentence vectors respectively corresponding to target bullet screen texts included in the ith group of target bullet screen text data subsets;
clustering each sentence vector corresponding to the ith group of target bullet screen text data subsets according to a pre-trained Gaussian mixture model to obtain an ith group of text clustering results corresponding to the ith group of target bullet screen text data subsets;
increasing the value of i by 1 and updating the value of i, and judging that i exceeds k; if i does not exceed k, returning to execute the step of obtaining the target bullet screen text included in the ith group of target bullet screen text data subset in the ith target time division segment;
if i exceeds k, the process ends.
In this embodiment, after k groups of target bullet screen text data subsets are obtained, text clustering needs to be performed on each group of target bullet screen text data subsets in sequence. In the text clustering process, for example, taking the 1 st group of target bullet screen text data subsets in the 1 st target time partition segment as an example, first, each target bullet screen text in the 1 st group of target bullet screen text data subsets is correspondingly converted into a sentence vector (the conversion of the text into the sentence vector is prior art and is not repeated here), and then, the sentence vectors respectively corresponding to each target bullet screen text are input into a gaussian mixture model for clustering, so as to obtain the corresponding 1 st group of text clustering results.
The gaussian mixture model is an object precisely quantized by using a gaussian probability density distribution function, and can be generally used for solving the problem that data in the same set comprises a plurality of different distributions, and the data in the same type of distribution corresponds to the same gaussian probability density distribution function. Moreover, the use of a Gaussian mixture model for text clustering is similar to the K-means clustering (i.e., K-means clustering) approach.
And a target clustering text subset obtaining unit 107, configured to obtain clustering texts whose descending ranking of text clustering numbers in each text clustering result does not exceed a preset ranking threshold, and form target clustering text subsets corresponding to each text clustering result.
In this embodiment, when the core problem set of each question hot spot time partition is screened, the cluster text corresponding to the top 3 of the cluster number in each question hot spot time partition may be selected, so as to form the core problem set corresponding to each question hot spot time partition. In this way, the number of problems is effectively reduced, so that the final result focuses on the core problem set.
And a mixed data set obtaining unit 108, configured to obtain a time period, time period video data, and a target clustering text subset corresponding to each target time partition, to form a mixed data set corresponding to each target time partition, and send the mixed data set to a target user side.
In this embodiment, the time period video data, and the target clustered text subset (the target clustered text subset may be understood as a core question set) corresponding to each of the questioning hot time division segments may be combined to obtain a mixed data set, and the mixed data set is returned to the target user terminal used by the instructor. Thus, the lecturer can acquire a group of classroom contents with high doubt degree without reviewing the whole video. (if the teaching is not presented in the form of PPT or blackboard-writing combined explanation, the explanation content cannot be reflected in the picture, the audio frequency in a period of time before and after the time node can be subjected to character conversion and then keywords can be refined, and the keywords are fed back to the instructor in the form of replacing the video picture.)
In one embodiment, the apparatus 100 for acquiring a question set based on a barrage text further includes:
and the reply data receiving unit is used for receiving reply data which is sent by the target user side and respectively corresponds to each mixed data set, adding the reply data of each mixed data set into the time period video data of the corresponding target time division segment, and obtaining the answer video data corresponding to the target video data.
In this embodiment, the instructor uploads reply data (which may be understood as supplementary data that can help the user watching the video to answer the question) for each mixed data set.
In one embodiment, the reply data receiving unit is further configured to:
and adding the original text data or hyperlink address of the reply data of each mixed data set to the time period video data of the corresponding target time division segment to obtain the corresponding reply video data.
In this embodiment, the supplementary data uploaded by the target user side may be original text data or a hyperlink address corresponding to the original text data, so that other users watching the answering video data can directly click and view the supplementary data, or a hyperlink address corresponding to the answering video data is provided for the users to click and skip to view the supplementary data. After the instructor operates the target user side to upload the supplementary materials, corresponding supplementary material links are displayed in the video picture within a period of time from the time division of each extracted question hot spot, so that students can conveniently and directly conduct emphatic learning aiming at the current difficulty.
The device realizes refining the core problem set in the video barrage text through a natural language processing technology and a time sequence inflection point technology, can quickly acquire the core problem set in the barrage text without a user reviewing the whole video, and improves the extraction and positioning efficiency of the hot question barrage text.
The above-mentioned problem set acquisition apparatus based on the bullet screen text can be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 4, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a bullet screen text-based problem set acquisition method.
The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.
The internal memory 504 provides an environment for running the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 may be caused to execute the bullet-screen-text-based problem set acquisition method.
The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The processor 502 is configured to run the computer program 5032 stored in the memory to implement the bullet screen text-based problem set obtaining method disclosed in the embodiment of the present invention.
Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 4 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 4, and are not described herein again.
It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the bullet screen text-based problem set acquisition method disclosed by the embodiment of the invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A question set acquisition method based on a bullet screen text is characterized by comprising the following steps:
acquiring a bullet screen text data set of the selected target video data in the last bullet screen acquisition period;
inputting each bullet screen text data in the bullet screen text data set to a pre-trained text sentence type identification model to obtain a text sentence type corresponding to each bullet screen text data, and acquiring bullet screen text data of which the text sentence type in the bullet screen text data set is an question sentence to form a target bullet screen text data set;
according to the bullet screen sending time corresponding to each piece of target bullet screen text data in the target bullet screen text data set and a plurality of time segments corresponding to the target video data, counting the number of target bullet screen text data corresponding to each time segment according to the ascending time sequence to form a problem number time sequence;
performing inflection point detection on the problem quantity time series to obtain an inflection point detection result set;
acquiring rising inflection points in the inflection point detection result set and time division sections corresponding to the rising inflection points, and combining the time division sections corresponding to the rising inflection points to obtain a target time division section set;
respectively performing text clustering on target bullet screen text data subsets corresponding to each target time partition in the target time partition set to obtain a text clustering result corresponding to each target time partition;
acquiring clustering texts of which the descending ranking of the text clustering quantity in each text clustering result does not exceed a preset ranking threshold value, and forming target clustering text subsets respectively corresponding to each text clustering result; and
and acquiring a time period, time period video data and a target clustering text subset corresponding to each target time division segment, forming a mixed data set corresponding to each target time division segment, and sending the mixed data set to a target user side.
2. The method of claim 1, wherein the step of obtaining the time segment, the time segment video data, and the target clustered text subset corresponding to each target time segment comprises forming a mixed data set corresponding to each target time segment, and after the mixed data set is sent to a target user, the method further comprises:
and receiving reply data which is sent by a target user side and respectively corresponds to each mixed data set, adding the reply data of each mixed data set to the time period video data of the corresponding target time division segment, and obtaining the reply video data corresponding to the target video data.
3. The bullet screen text-based problem set acquisition method according to claim 1, wherein the text sentence type recognition model is a support vector machine classification model;
before inputting each piece of bullet screen text data in the bullet screen text data set into a pre-trained text sentence type identification model to obtain a text sentence type corresponding to each piece of bullet screen text data and obtaining bullet screen text data with a text sentence type of an question sentence in the bullet screen text data set to form a target bullet screen text data set, the method further comprises the following steps:
acquiring a historical bullet screen text set as a sample set;
obtaining a sentence vector corresponding to each historical bullet screen text in the historical bullet screen text set;
acquiring a text statement type marking value corresponding to each historical bullet screen text; wherein, the bullet screen text label value of the text statement type of question sentence is 1, and the bullet screen text label value of the text statement type of non-question sentence is 0;
and taking the sentence vector corresponding to each historical bullet screen text as the input of the classification model of the support vector machine to be trained, taking the label value corresponding to the sentence vector as the output of the classification model of the support vector machine to be trained, training the classification model of the support vector machine to be trained to obtain the classification model of the support vector machine, and acquiring the classification hyperplane corresponding to the classification model of the support vector machine.
4. The method for obtaining a question set based on a bullet screen text according to claim 1, wherein the step of counting the number of target bullet screen text data corresponding to each time partition in a time ascending order according to the bullet screen sending time corresponding to each target bullet screen text data in the target bullet screen text data set and a plurality of time partitions corresponding to the target video data to form a question number time sequence comprises:
dividing the target video time length corresponding to the target video data according to a preset time window value to obtain a time division segment set corresponding to the target video time length;
counting and acquiring a target bullet screen text data subset corresponding to each time partition in the time partition set according to bullet screen sending time corresponding to each target bullet screen text data in the target bullet screen text data set;
and counting the number of the bullet screens corresponding to each target bullet screen text data subset in sequence according to the time ascending sequence of the time partition corresponding to each target bullet screen text data subset to form a problem number time sequence.
5. The method for acquiring the question set based on the barrage text as claimed in claim 1, wherein the step of performing inflection point detection on the time series of the number of questions to obtain an inflection point detection result set comprises:
calling a prestored bisection segmentation model, and acquiring a loss function corresponding to the bisection segmentation model;
and obtaining a target problem quantity value which meets the condition that the total loss corresponding to all the problem quantity values is the minimum value in the problem quantity time sequence through a binary segmentation model, and forming an inflection point detection result set.
6. The method for acquiring the bullet screen text-based problem set according to claim 1, wherein the step of performing text clustering on the target bullet screen text data subsets corresponding to each target time partition in the target time partition set to obtain a text clustering result corresponding to each target time partition comprises:
acquiring a target bullet screen text included in an ith group of target bullet screen text data subsets in an ith target time division segment; wherein the initial value of i is 1, the value range of i is [1, k ], and the value of k is equal to the total number of the target time division sections included in the target time division section set;
obtaining sentence vectors respectively corresponding to target bullet screen texts included in the ith group of target bullet screen text data subsets;
clustering each sentence vector corresponding to the ith group of target bullet screen text data subsets according to a pre-trained Gaussian mixture model to obtain an ith group of text clustering results corresponding to the ith group of target bullet screen text data subsets;
increasing the value of i by 1 and updating the value of i, and judging that i exceeds k; if i does not exceed k, returning to execute the step of obtaining the target bullet screen text included in the ith group of target bullet screen text data subset in the ith target time division segment;
if i exceeds k, the process ends.
7. The method of claim 2, wherein the adding the answer data of each mixed data set to the time segment video data of the corresponding target time segment to obtain the answer video data corresponding to the target video data comprises:
and adding the original text data or hyperlink address of the reply data of each mixed data set to the time period video data of the corresponding target time division segment to obtain the corresponding reply video data.
8. A problem set acquisition device based on barrage text is characterized by comprising:
the bullet screen data set acquisition unit is used for acquiring a bullet screen text data set of the selected target video data in the last bullet screen acquisition period;
a target barrage text acquiring unit, configured to input each barrage text data in the barrage text data set to a pre-trained text sentence type identification model, obtain a text sentence type corresponding to each barrage text data, and acquire barrage text data in the barrage text data set, where the text sentence type is a question sentence, to form a target barrage text data set;
a problem quantity time sequence obtaining unit, configured to count, according to the bullet screen sending time corresponding to each piece of target bullet screen text data in the target bullet screen text data set and a plurality of time segments corresponding to the target video data, the quantity of target bullet screen text data corresponding to each time segment in an ascending time order, and form a problem quantity time sequence;
the inflection point detection unit is used for carrying out inflection point detection on the problem quantity time sequence to obtain an inflection point detection result set;
a target time partition set obtaining unit, configured to obtain rising inflection points in the inflection point detection result set and a time partition corresponding to each rising inflection point, and combine the time partitions corresponding to each rising inflection point to obtain a target time partition set;
the text clustering result acquisition unit is used for respectively carrying out text clustering on the target bullet screen text data subsets corresponding to each target time division segment in the target time division segment set to obtain a text clustering result corresponding to each target time division segment;
the target clustering text subset acquisition unit is used for acquiring clustering texts of which the descending ranking of the text clustering quantity in each text clustering result does not exceed the corresponding ranking threshold value, and forming target clustering text subsets respectively corresponding to each text clustering result; and
and the mixed data set acquisition unit is used for acquiring the time period, the time period video data and the target clustering text subset corresponding to each target time division segment, forming a mixed data set corresponding to each target time division segment, and sending the mixed data set to the target user side.
9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the bullet screen text based problem set acquisition method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to execute the bullet screen text-based problem set acquisition method according to any one of claims 1 to 7.
CN202110430212.3A 2021-04-21 2021-04-21 Bullet screen text-based problem set acquisition method and device and computer equipment Active CN112995719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110430212.3A CN112995719B (en) 2021-04-21 2021-04-21 Bullet screen text-based problem set acquisition method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110430212.3A CN112995719B (en) 2021-04-21 2021-04-21 Bullet screen text-based problem set acquisition method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN112995719A true CN112995719A (en) 2021-06-18
CN112995719B CN112995719B (en) 2021-07-27

Family

ID=76341512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110430212.3A Active CN112995719B (en) 2021-04-21 2021-04-21 Bullet screen text-based problem set acquisition method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN112995719B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781853A (en) * 2021-08-23 2021-12-10 安徽教育出版社 Teacher-student remote interactive education platform based on terminal
CN115348479A (en) * 2022-07-22 2022-11-15 北京奇艺世纪科技有限公司 Video playing problem identification method, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07107410A (en) * 1993-09-30 1995-04-21 Sanyo Electric Co Ltd Television receiver with slave screen display function
CN104967876A (en) * 2014-09-30 2015-10-07 腾讯科技(深圳)有限公司 Pop-up information processing method and apparatus, and pop-up information display method and apparatus
CN106028176A (en) * 2016-05-31 2016-10-12 北京奇艺世纪科技有限公司 Method and device for determining content explosion point in streaming media
CN107656948A (en) * 2016-11-14 2018-02-02 平安科技(深圳)有限公司 The problem of in automatically request-answering system clustering processing method and device
US20180191987A1 (en) * 2017-01-04 2018-07-05 International Business Machines Corporation Barrage message processing
CN110427897A (en) * 2019-08-07 2019-11-08 北京奇艺世纪科技有限公司 Analysis method, device and the server of video highlight degree
CN112201099A (en) * 2019-07-08 2021-01-08 苏州易学在线文化传播有限公司 Online education platform
CN112672202A (en) * 2020-12-28 2021-04-16 广州博冠信息科技有限公司 Bullet screen processing method, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07107410A (en) * 1993-09-30 1995-04-21 Sanyo Electric Co Ltd Television receiver with slave screen display function
CN104967876A (en) * 2014-09-30 2015-10-07 腾讯科技(深圳)有限公司 Pop-up information processing method and apparatus, and pop-up information display method and apparatus
CN106028176A (en) * 2016-05-31 2016-10-12 北京奇艺世纪科技有限公司 Method and device for determining content explosion point in streaming media
CN107656948A (en) * 2016-11-14 2018-02-02 平安科技(深圳)有限公司 The problem of in automatically request-answering system clustering processing method and device
US20180191987A1 (en) * 2017-01-04 2018-07-05 International Business Machines Corporation Barrage message processing
CN112201099A (en) * 2019-07-08 2021-01-08 苏州易学在线文化传播有限公司 Online education platform
CN110427897A (en) * 2019-08-07 2019-11-08 北京奇艺世纪科技有限公司 Analysis method, device and the server of video highlight degree
CN112672202A (en) * 2020-12-28 2021-04-16 广州博冠信息科技有限公司 Bullet screen processing method, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QUAN MIAO: "Research on information security mechanism for barrage videos", 《2016 IEEE ADVANCED INFORMATION MANAGEMENT, COMMUNICATES, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IMCEC)》 *
洪庆: "基于弹幕情感分析和聚类算法的视频用户群体分类", 《计算机工程与科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781853A (en) * 2021-08-23 2021-12-10 安徽教育出版社 Teacher-student remote interactive education platform based on terminal
CN115348479A (en) * 2022-07-22 2022-11-15 北京奇艺世纪科技有限公司 Video playing problem identification method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112995719B (en) 2021-07-27

Similar Documents

Publication Publication Date Title
CN112346567B (en) Virtual interaction model generation method and device based on AI (Artificial Intelligence) and computer equipment
CN110364146B (en) Speech recognition method, speech recognition device, speech recognition apparatus, and storage medium
CN112632385A (en) Course recommendation method and device, computer equipment and medium
CN109033408B (en) Information pushing method and device, computer readable storage medium and electronic equipment
CN111368042A (en) Intelligent question and answer method and device, computer equipment and computer storage medium
CN112995719B (en) Bullet screen text-based problem set acquisition method and device and computer equipment
CN106126524B (en) Information pushing method and device
CN113590850A (en) Multimedia data searching method, device, equipment and storage medium
CN113392331A (en) Text processing method and equipment
CN112995690B (en) Live content category identification method, device, electronic equipment and readable storage medium
CN112417127A (en) Method, device, equipment and medium for training conversation model and generating conversation
CN108319588A (en) Text emotion analysis system and method, storage medium
CN112163560A (en) Video information processing method and device, electronic equipment and storage medium
CN111192170B (en) Question pushing method, device, equipment and computer readable storage medium
CN111597446B (en) Content pushing method and device based on artificial intelligence, server and storage medium
Emmery et al. Simple queries as distant labels for predicting gender on twitter
CN109558531A (en) News information method for pushing, device and computer equipment
Arai et al. Predicting quality of answer in collaborative Q/A community
CN113342942B (en) Corpus automatic acquisition method and device, computer equipment and storage medium
CN111639485A (en) Course recommendation method based on text similarity and related equipment
CN104951434A (en) Brand emotion determining method and device
CN110427620B (en) Service quality optimization management system based on community system
CN113407772A (en) Video recommendation model generation method, video recommendation method and device
CN115982351A (en) Test question evaluation method and related device, electronic equipment and storage medium
CN114548263A (en) Method and device for verifying labeled data, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant