CN113761986A

CN113761986A - Text acquisition method, text live broadcast equipment and storage medium

Info

Publication number: CN113761986A
Application number: CN202010506291.7A
Authority: CN
Inventors: 曹雅婷; 胡琨
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2021-12-07

Abstract

The embodiment of the application provides a text acquisition method, a live broadcast device and a storage medium. In some embodiments of the present application, first, a text acquisition device converts a video stream containing the question and answer content of a target user into an image stream and an audio stream; then, respectively identifying corresponding speaking intervals of the target user in the image stream and the audio stream; then, fusing the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepting a target audio segment corresponding to the target speaking interval from the audio stream, and improving the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; and finally, performing text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, wherein the accuracy rate of extracting the text content from the video stream is high.

Description

Text acquisition method, text live broadcast equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a text acquisition method, a live broadcast method, a device, and a storage medium.

Background

Currently, in many application scenarios, there is a question and answer text extraction requirement. For example, in the question and answer link of important guest interviews in a large conference, the question and answer content can be used in manuscript or news announcement after being precipitated; the E-commerce live broadcasts the answers to the commodity questions aiming at the anchor, the questions and answers can be reserved by the platform, and the questions of the user can be automatically replied in the future.

Currently, the efficiency of extracting question and answer texts from video or audio is low.

Disclosure of Invention

Aspects of the present application provide a data processing technology field, and in particular, to a text acquisition and live broadcast method, device, and storage medium, which improve efficiency of extracting a question and answer text.

The embodiment of the application provides a text acquisition method, which comprises the following steps:

converting a video stream containing the question and answer content of the target user into an image stream and an audio stream;

respectively identifying corresponding speaking intervals of a target user in the image stream and the audio stream;

intercepting a target audio clip containing the speaking content of the target user from the audio stream according to the corresponding speaking intervals of the identified target user in the image stream and the audio stream;

and performing text conversion on the target audio clip containing the speaking content of the target user to obtain the speaking text content of the target user.

An embodiment of the present application further provides an information processing apparatus, including: one or more processors and one or more memories storing computer programs;

the one or more processors to execute the computer program to:

Embodiments of the present application also provide a computer-readable storage medium storing a computer program that, when executed by one or more processors, causes the one or more processors to perform actions comprising:

The embodiment of the application further provides a live broadcasting method, which includes:

receiving a video stream which is sent by an intelligent terminal and contains the question and answer content of a target user;

respectively identifying corresponding speaking intervals of a target user in an image stream and an audio stream in a video stream;

obtaining the speaking text content of the target user according to the recognized corresponding speaking intervals of the target user in the image stream and the audio stream;

according to a pre-learned mode of inputting a question-answer marking model to text content, carrying out question-answer marking on the text content to obtain a question-answer text;

receiving a target question sent by an intelligent terminal, and selecting a target answer corresponding to the target question from a question and answer text;

and sending the target answer to an intelligent terminal so that a user can check the answer.

An embodiment of the present application further provides a server, including: one or more processors and one or more memories storing computer programs;

the one or more processors to execute the computer program to:

In some embodiments of the present application, first, a text acquisition device converts a video stream containing the question and answer content of a target user into an image stream and an audio stream; then, respectively identifying corresponding speaking intervals of the target user in the image stream and the audio stream; then, fusing the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepting a target audio segment corresponding to the target speaking interval from the audio stream, and improving the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; and finally, performing text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, wherein the accuracy rate of extracting the text content from the video stream is high.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1a is a schematic structural diagram of a text acquisition system according to an exemplary embodiment of the present application;

FIG. 1b is a schematic diagram of another text acquisition method provided in an exemplary embodiment of the present application;

fig. 2a is a flowchart of a method of obtaining a text according to an exemplary embodiment of the present application;

fig. 2b is a flowchart of a method of obtaining a text according to an exemplary embodiment of the present application;

fig. 3a is a schematic flowchart of a news playing method according to an exemplary embodiment of the present application;

fig. 3b is a schematic flowchart of a live broadcast method according to an exemplary embodiment of the present application;

fig. 4 is a schematic structural diagram of a text acquisition apparatus according to an exemplary embodiment of the present application;

fig. 5 is a schematic structural diagram of a text acquisition device according to an exemplary embodiment of the present application;

fig. 6 is a schematic structural diagram of a video capture device according to an exemplary embodiment of the present application;

fig. 7 is a schematic structural diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, in many application scenarios, there is a question and answer text extraction requirement. For example, in the question-answer link of important guest interviews in a large conference, the question-answer content can be used in a manuscript or a propaganda after being precipitated; the E-commerce live broadcasts the answers to the commodity questions aiming at the anchor, the questions and answers can be reserved by the platform, and the questions of the user can be automatically replied in the future. At present, the efficiency of extracting question and answer texts from videos or audios is low, and the accuracy is low.

In some embodiments of the present application, first, a text acquiring device converts a video stream containing the question and answer content of a target user into an image stream and an audio stream; then, respectively identifying corresponding speaking intervals of the target user in the image stream and the audio stream; then, fusing the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepting a target audio segment corresponding to the target speaking interval from the audio stream, and improving the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; and finally, performing text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, wherein the accuracy rate of extracting the text content from the video stream is high.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1a is a schematic structural diagram of a text acquisition system 10 according to an exemplary embodiment of the present application. As shown in fig. 1a, the text acquisition system 10 includes a video recording device 10a and a server 10 b.

In this embodiment, video recording device 10a includes an image sensor and an audio sensor for capturing a video stream including the content of the target user's question and answer. The embodiment of the present application does not limit the type of video recording apparatus 10 a. Video recording device 10a may be a camera, a robot, a personal computer, a smart television, a smart sound box, or other smart devices. Video recording device 10a may have computing, communication, internet access, etc. functions in addition to basic service functions.

Video recording device 10a may be one or more. Video recording device 10a typically includes at least one processing unit, at least one memory, a display, an image sensor, and an audio sensor. The number of processing units and memories depends on the configuration and type of video recording device 10 a.

The display may include a screen, primarily for displaying various types of information. Alternatively, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or slide action but also detect information such as duration and pressure related to the touch or slide operation.

The Memory may include volatile, such as RAM, non-volatile, such as Read-Only Memory (ROM), flash Memory, etc., or both. The memory typically stores an Operating System (OS), one or more application programs, and may also store program data and the like.

In addition to the processing unit, memory, and display, video recording device 10a includes some basic configurations, such as a network card chip, an IO bus, audio and video components, and so on. Optionally, video recording device 10a may also include peripheral devices such as a keyboard, mouse, stylus, printer, etc. These peripheral devices are well known in the art and will not be described in detail herein.

In this embodiment, the video recording device 10a is mainly configured to obtain a video stream containing the content of the question and answer of the target user, and send the video stream containing the content of the question and answer of the target user to the server 10b, so that the server 10b extracts the text content spoken by the target user from the video stream. The server 10b may further perform question and answer tagging on the text content, and the tagged text may be used in a manuscript or news announcement.

In the present embodiment, a video recognition model, an audio clip determination model, and a question and answer mark model are deployed in the server 10 b. The embodiment of the present application does not limit the implementation form of the server 10b, and for example, the server 10b may be a server device such as a conventional server, a cloud host, a virtual center, and the like. The server device mainly comprises a processor, a hard disk, a memory, a system bus and the like, and a general computer architecture type. The server 10b may include one web server or a plurality of web servers.

Video recording device 10a and server 10b are connected by a wireless or limited network. In this embodiment, if the video recording device 10a is communicatively connected to the server 10b through a mobile network, the network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), WiMax, and the like.

In a news conference scenario, video recording device 10a records video of a news conference during the news conference. In the actual operation process, in order to improve the recognition and matching effects, the face of a speaker is aligned as much as possible in the video recording process so as to improve the recognition rate. After the video recording is finished, the video recording device 10a may perform operations such as clipping on the video; video recording device 10a transmits a video stream containing the content of the target user's question and answer to server 10 b. The server 10b converts the video stream containing the question and answer content of the target user into an image stream and an audio stream; the server 10b respectively identifies the speaking intervals of the target user in the image stream and the audio stream; the server 10b fuses the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepts a target audio segment corresponding to the target speaking interval from the audio stream, and improves the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; the server 10b performs text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, and the accuracy rate of extracting the text content from the video stream is high. In addition, the server 10b may also perform question and answer tagging on the text content according to a previously learned manner of inputting a question and answer tagging model on the text content, so as to obtain a question and answer text. The application does not limit the use of the question and answer text, and the question and answer text can be used in subsequent manuscript or news announcement.

In a live shopping scene, the anchor can record live shopping video by using the video recording device 10a used by the anchor in the live shopping process. After the video recording is finished, the video recording device 10a may perform operations such as clipping on the video; video recording device 10a transmits a video stream containing the content of the target user's question and answer to server 10 b. The server 10b converts the video stream containing the question and answer content of the target user into an image stream and an audio stream; the server 10b respectively identifies the speaking intervals of the target user in the image stream and the audio stream; the server 10b fuses the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepts a target audio segment corresponding to the target speaking interval from the audio stream, and improves the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; the server 10b performs text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, and the accuracy rate of extracting the text content from the video stream is high. In addition, the server 10b may also perform question and answer tagging on the text content according to a previously learned manner of inputting a question and answer tagging model on the text content, so as to obtain a question and answer text. The application does not limit the use of the question and answer text, the question and answer text can be reserved by the platform, and the user can automatically answer the question in the future.

In the present embodiment, the server 10b converts the video stream containing the content of the target user question and answer into an image stream and an audio stream after receiving the video stream containing the content of the target user question and answer.

In the present embodiment, the server 10b identifies the speaking sections of the target user in the image stream and the audio stream, respectively. The server 10b separates the image stream and the audio stream in the video stream to determine a target speaking interval by using the corresponding speaking intervals in the image stream and the audio stream, and intercepts a target audio segment corresponding to the target speaking interval from the audio stream.

It should be noted that, the number of the target users is not limited in the present application, the number of the target users may be one or more, and the number of the target users may be adjusted according to the actual application scenario. For example, in a news distribution conference scenario, the number of speaking target users may be multiple; in a shopping live scene, the number of target users speaking may be one.

In the present embodiment, the server 10b identifies the speaking sections of the target user in the image stream and the audio stream. The server 10b identifies the speaking section of the target user in the image stream, which can be implemented by identifying the speaking section of the target user in the image stream according to the image characteristics of the target user. Optionally, inputting the image stream and the image features containing the target user into a video recognition model; extracting an image stream segment containing the image characteristics of the target user according to the image characteristics of the target user in the video identification model; and identifying the speaking interval of the target user in the image stream from the image stream segment containing the image characteristics of the target user by using a convolutional neural network algorithm.

Before the image stream and the image characteristics containing the target user are input into the video recognition model, model training is carried out by utilizing an image stream sample, an image characteristic sample containing the user and a speaking interval sample of the user in the image stream to obtain the video recognition model. Optionally, extracting an image stream segment containing an image feature sample of the user from the image stream sample; inputting the image stream segment containing the image characteristic sample of the user and the speaking interval sample of the user in the image stream into a convolutional neural network algorithm, and establishing a mapping relation between the image stream segment containing the image characteristic sample of the user and the speaking interval of the user in the image stream to obtain a video identification model.

In such embodiments, the image stream samples in a variety of scenes are covered as much as possible when the training samples are collected to improve sample coverage. And marking the actual speaking interval of the user in the image stream to obtain the actual speaking interval of the user in the image stream on the training sample.

Then, the marked training sample can be input into a neural network model, in the content of the neural network model, an image stream segment containing the image characteristic sample of the user can be extracted from the image stream sample according to the model parameters, and a mapping relation between the image stream segment containing the image characteristic sample of the user and the speaking section of the user in the image stream is established by a convolution neural network algorithm. Then, the loss function layer of the neural network model can calculate the loss function according to the difference between the speaking section of the user in the image stream and the actual speaking section distribution of the user in the image stream, which are output by the output layer. And when the loss function of the neural network model meets the set requirement, the trained video recognition model can be obtained.

Within the video recognition model, the image stream may be preprocessed, for example, by pre-labeled questioning actions, guest answers, and recognition of onsite large screen text, filtered.

The server 10b identifies the corresponding speaking interval of the target user in the audio stream, in one implementation manner, the speaking interval of the target user in the audio stream is identified according to the voiceprint feature of the target user. Optionally, inputting the audio stream and the voiceprint characteristics of the target user into an audio recognition model; extracting an audio clip containing the voiceprint characteristics of the target user in the audio recognition model; the speaking interval of the target user in the audio stream is identified from the audio segment containing the voiceprint characteristics of the target user.

Before inputting the audio stream and the voiceprint feature of the target user into the audio recognition model, the server 10b performs model training by using the audio stream sample, the voiceprint feature sample containing the user, and the speaking interval sample of the user in the audio stream, so as to obtain the audio recognition model.

In such embodiments, the audio stream samples in a variety of scenarios are covered as much as possible when the training samples are collected to improve sample coverage. And marking the actual speaking interval of the user in the audio stream to obtain the actual speaking interval of the user in the audio stream on the training sample.

Then, the marked training sample can be input into a neural network model, in the content of the neural network model, the training sample can be extracted from the audio stream sample according to the model parameters to obtain an image stream segment containing the voiceprint feature sample of the user, and a convolutional neural network algorithm is used for establishing the mapping relation between the audio stream segment containing the voiceprint feature sample of the user and the speaking interval of the user in the audio stream. Then, the loss function layer of the neural network model can calculate the loss function according to the difference between the speaking interval of the user in the audio stream and the actual speaking interval distribution of the user in the audio stream, which are output by the output layer. And when the loss function of the neural network model meets the set requirement, the trained audio recognition model can be obtained.

In this embodiment, a target audio segment containing the speaking content of the target user is intercepted from the audio stream according to the recognized corresponding speaking interval of the target user in the image stream and the audio stream.

Optionally, model training is performed by using the corresponding speaking intervals of the user in the image stream and the audio stream and the target audio clip containing the speaking content of the target user, and a mapping relation between the corresponding speaking intervals of the user in the image stream and the audio stream and the target audio clip containing the speaking content of the target user is established, so as to obtain an audio clip determination model.

In such an embodiment, when training samples are collected, corresponding speaking intervals in image streams and audio streams in various scenes are covered as much as possible, so as to improve the sample coverage rate. And marking the actual audio segment of the speaking content of the target user to obtain the actual audio segment of the speaking content of the target user on the training sample.

Then, the marked training sample can be input into a neural network model, in the content of the neural network model, the training sample can be determined to be an audio segment containing the user speaking content from the corresponding speaking interval of the user in the image stream and the audio stream according to the model parameters, and the mapping relation between the corresponding speaking interval of the user in the image stream and the audio stream and the target audio segment containing the target user speaking content is established. Then, the loss function layer of the neural network model can calculate the loss function according to the difference between the distribution condition of the target audio segment of the speaking content of the target user and the actual audio segment of the speaking content of the target user, which are output by the output layer. And when the loss function of the neural network model meets the set requirement, obtaining the trained audio clip determination model.

In this embodiment, the server 10b performs text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user. The server 10b performs text conversion on the target audio segment containing the speaking content of the target user by using a text recognition algorithm to obtain the speaking text content of the target user.

In this embodiment, the server 10b performs question and answer tagging on the text content according to a previously learned manner of inputting a question and answer tagging model on the text content, so as to obtain a question and answer text. One way to implement this is to input the text content into a question-answer tagging model for question-answer tagging to obtain a question-answer text.

In the embodiment, before the text content is input into the question-answer marking model, model training is carried out by using the text content sample and the question-answer text sample after the question-answer marking is carried out on the text content sample, and the mapping relation between the text content sample and the question-answer text sample is established to obtain the question-answer marking model.

In such an embodiment, when training samples are collected, the text content and the corresponding question and answer text in various scenes are covered as much as possible, so as to improve the sample coverage rate. And labeling the question and answer text to obtain the labeled question and answer text.

Then, the marked training sample can be input into a neural network model, and the training sample can be subjected to question-answering marking on the text content sample according to the model parameters in the content of the neural network model. And then, the loss function layer of the neural network model can calculate a loss function according to the marked question-answer text and the actual question-answer text output by the output layer. And when the loss function of the neural network model meets the set requirement, obtaining the trained question-answer mark model.

In some embodiments, the user may enter the question text snippet in advance. For example, in a conference scenario, a meeting place questioning tool is provided to collect audience questions, which may be a text segment in which questions exist directly in a text form, or a text segment in which questions exist in a voice form is converted into texts to obtain a question text segment. The server 10b generates candidate question text segments after performing question and answer marking on the text content; similarity calculation can be carried out on the generated candidate question text segments and the question text segments to determine whether the candidate question text segments can be used as question texts in the question and answer texts. One way to implement this is that the server 10b determines whether there is a target question text segment in the preset question text segments whose similarity to the candidate question text segment is greater than a set similarity threshold; if yes, the target question text segment is used as a question text in the question and answer text; if not, the candidate question text segments are used as question texts in the question and answer texts.

Fig. 1b is a schematic diagram of another text acquisition method according to an exemplary embodiment of the present application. As shown in fig. 1b, after the text acquiring device acquires the video stream containing the question and answer content of the target user, the text acquiring device converts the video stream containing the question and answer content of the target user into an image stream and an audio stream; secondly, the text acquisition equipment respectively identifies corresponding speaking intervals of the target user in the image stream and the audio stream; then, the text acquisition equipment fuses the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine the target speaking interval, intercepts a target audio clip corresponding to the target speaking interval from the audio stream, and improves the identification rate of the audio clip corresponding to the speaking content of the target user by combining the image stream and the audio stream; secondly, performing text conversion on the target audio clip containing the speaking content of the target user to obtain the speaking text content of the target user, wherein the accuracy rate of extracting the text content from the video stream is high; and finally, performing question and answer marking on the obtained text content spoken by the target user to form a question and answer text which can be used in subsequent manuscript or news announcement and can also automatically answer the questions of the user.

In the system embodiment of the present application, first, the text acquisition device converts a video stream containing the question and answer content of the target user into an image stream and an audio stream; then, respectively identifying corresponding speaking intervals of the target user in the image stream and the audio stream; then, fusing the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepting a target audio segment corresponding to the target speaking interval from the audio stream, and improving the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; and finally, performing text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, wherein the accuracy rate of extracting the text content from the video stream is high.

In addition to the text acquisition system provided above, some embodiments of the present application also provide a text acquisition method, and the text acquisition method provided in the present application may be applied to the text acquisition system described above, but is not limited to the text acquisition system provided in the above embodiments.

Fig. 2a is a flowchart of a method of obtaining a text according to an exemplary embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

s211: converting a video stream containing the question and answer content of the target user into an image stream and an audio stream;

s212: respectively identifying corresponding speaking intervals of a target user in the image stream and the audio stream;

s213: intercepting a target audio clip containing the speaking content of the target user from the audio stream according to the corresponding speaking intervals of the identified target user in the image stream and the audio stream;

s214: and performing text conversion on the target audio clip containing the speaking content of the target user to obtain the speaking text content of the target user.

In this embodiment, the main text acquiring device for executing the text acquiring method may be a computer device or a server. The text acquisition equipment is provided with a video identification model, an audio fragment determination model and a question and answer mark model. When the text acquisition device is a server, the embodiment of the present application does not limit the implementation form of the server, and for example, the server may be a conventional server, a cloud host, a virtual center, or other server devices. The server device mainly comprises a processor, a hard disk, a memory, a system bus and the like, and a general computer architecture type. The server may include one web server or a plurality of web servers.

In this embodiment, video recording is performed in advance by using a video recording device. The video recording device includes an image sensor and an audio sensor for capturing a video stream including the content of the target user's question and answer. The embodiment of the present application does not limit the type of the video recording device. The video recording device can be a camera, a robot, a personal computer, an intelligent television, an intelligent sound box and other intelligent devices. The video recording device has basic service functions, and also has functions of calculation, communication, internet access and the like.

The video recording device may be one or more. A video recording device typically comprises at least one processing unit, at least one memory, a display, an image sensor and an audio sensor. The number of processing units and memories depends on the configuration and type of video recording device.

In this embodiment, the video recording device is mainly configured to acquire a video stream including the question and answer content of the target user, and send the video stream including the question and answer content of the target user to the text acquisition device, so that the text acquisition device extracts the text content spoken by the target user from the video stream. The text acquisition equipment can further perform question and answer marking on the text content, and the marked text can be used for manuscript or news announcement.

In a news release conference scene, the video recording equipment records the video of the news release conference in the process of the news release conference. In the actual operation process, in order to improve the recognition and matching effects, the face of a speaker is aligned as much as possible in the video recording process so as to improve the recognition rate. After the video recording is finished, the video recording equipment can carry out operations such as clipping on the video; and the video recording equipment sends the video stream containing the question and answer content of the target user to the text acquisition equipment. The text acquisition equipment converts the video stream containing the question and answer content of the target user into an image stream and an audio stream; the text acquisition equipment respectively identifies corresponding speaking intervals of a target user in the image stream and the audio stream; the text acquisition equipment fuses the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine the target speaking interval, intercepts a target audio segment corresponding to the target speaking interval from the audio stream, and improves the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; the text acquisition equipment performs text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, and the accuracy rate of extracting the text content from the video stream is high. In addition, the text acquisition equipment can also perform question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text. The application does not limit the use of the question and answer text, and the question and answer text can be used in subsequent manuscript or news announcement.

In a shopping live broadcast scene, the anchor can record a shopping live broadcast video in the live broadcast process by utilizing the video recording equipment used by the anchor. After the video recording is finished, the video recording equipment can carry out operations such as clipping on the video; and the video recording equipment sends the video stream containing the question and answer content of the target user to the text acquisition equipment. The text acquisition equipment converts the video stream containing the question and answer content of the target user into an image stream and an audio stream; the text acquisition equipment respectively identifies corresponding speaking intervals of a target user in the image stream and the audio stream; the text acquisition equipment fuses the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine the target speaking interval, intercepts a target audio segment corresponding to the target speaking interval from the audio stream, and improves the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; the text acquisition equipment performs text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, and the accuracy rate of extracting the text content from the video stream is high. In addition, the text acquisition equipment can also perform question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text. The application does not limit the use of the question and answer text, the question and answer text can be reserved by the platform, and the user can automatically answer the question in the future.

In this embodiment, after receiving the video stream containing the question and answer content of the target user, the text acquisition device converts the video stream containing the question and answer content of the target user into an image stream and an audio stream.

In this embodiment, the text acquisition device identifies the speaking intervals of the target user in the image stream and the audio stream respectively. The text acquisition equipment separates the image stream and the audio stream in the video stream, so that a target speaking interval is determined by using corresponding speaking intervals in the image stream and the audio stream, and a target audio clip corresponding to the target speaking interval is intercepted from the audio stream.

In this embodiment, the text acquisition device identifies the corresponding speaking intervals of the target user in the image stream and the audio stream. The text acquisition device identifies the corresponding speaking interval of the target user in the image stream, and one realizable mode is to identify the speaking interval of the target user in the image stream according to the image characteristics of the target user. Optionally, inputting the image stream and the image features containing the target user into a video recognition model; extracting an image stream segment containing the image characteristics of the target user according to the image characteristics of the target user in the video identification model; and identifying the speaking interval of the target user in the image stream from the image stream segment containing the image characteristics of the target user by using a convolutional neural network algorithm.

The text acquisition equipment identifies the corresponding speaking interval of the target user in the audio stream, and one realizable mode is to identify the speaking interval of the target user in the audio stream according to the voiceprint characteristics of the target user. Optionally, inputting the audio stream and the voiceprint characteristics of the target user into an audio recognition model; extracting an audio clip containing the voiceprint characteristics of the target user in the audio recognition model; the speaking interval of the target user in the audio stream is identified from the audio segment containing the voiceprint characteristics of the target user.

Before inputting the audio stream and the voiceprint characteristics of the target user into the audio recognition model, the text acquisition equipment performs model training by using an audio stream sample, a voiceprint characteristic sample containing the user and a speaking interval sample of the user in the audio stream to obtain the audio recognition model.

In this embodiment, the text obtaining device performs text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user. The text acquisition equipment performs text conversion on the target audio frequency segment containing the speaking content of the target user by using a text recognition algorithm to obtain the speaking text content of the target user.

In this embodiment, the text acquisition device performs question and answer tagging on the text content according to a pre-learned manner of inputting a question and answer tagging model to the text content, so as to obtain a question and answer text. One way to implement this is to input the text content into a question-answer tagging model for question-answer tagging to obtain a question-answer text.

In some embodiments, the user may enter the question text snippet in advance. For example, in a conference scenario, a meeting place questioning tool is provided to collect audience questions, which may be a text segment in which questions exist directly in a text form, or a text segment in which questions exist in a voice form is converted into texts to obtain a question text segment. After the text acquisition equipment performs question and answer marking on text content, generating candidate question text segments; similarity calculation can be carried out on the generated candidate question text segments and the question text segments to determine whether the candidate question text segments can be used as question texts in the question and answer texts. One way to implement the method is that the text acquisition equipment judges whether a target problem text segment with the similarity between the candidate problem text segment and the preset problem text segment larger than a set similarity threshold exists in the preset problem text segment; if yes, the target question text segment is used as a question text in the question and answer text; if not, the candidate question text segments are used as question texts in the question and answer texts.

In the method embodiment of the present application, first, a text acquisition device converts a video stream containing question and answer content of a target user into an image stream and an audio stream; then, respectively identifying corresponding speaking intervals of the target user in the image stream and the audio stream; then, fusing the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepting a target audio segment corresponding to the target speaking interval from the audio stream, and improving the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; and finally, performing text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, wherein the accuracy rate of extracting the text content from the video stream is high.

Fig. 2b is a flowchart of a method of obtaining a text according to an exemplary embodiment of the present application. As shown in fig. 2b, the method comprises the steps of:

s221: converting a video stream containing the question and answer content of the target user into an image stream and an audio stream;

s222: respectively identifying corresponding speaking intervals of a target user in the image stream and the audio stream;

s223: intercepting a target audio clip containing the speaking content of the target user from the audio stream according to the corresponding speaking intervals of the identified target user in the image stream and the audio stream;

s224: performing text conversion on a target audio clip containing the speaking content of the target user to obtain the speaking text content of the target user;

s225: and performing question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 211 to 213 may be device a; for another example, the execution subjects of steps 211 and 212 may be device a, and the execution subject of step 213 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 211, 212, etc., are merely used for distinguishing different operations, and the sequence numbers themselves do not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The text acquisition method can be applied to application scenes such as news playing, conference summary, court trial and the like. The following description will be made taking a specific scenario as an example. Fig. 3a is a schematic flowchart of a news playing method according to an exemplary embodiment of the present application. As shown in fig. 3a, the method comprises:

s311: collecting a video stream containing the question and answer content of a target user on a news release site;

s312: respectively identifying corresponding speaking intervals of a target user in an image stream and an audio stream in a video stream;

s313: obtaining the speaking text content of the target user according to the recognized corresponding speaking intervals of the target user in the image stream and the audio stream;

s314: performing question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text;

s315: and sending the question and answer text to the playing terminal so that the playing terminal can display the question and answer.

In this embodiment, the main execution body of the news playing method is a video recording device. The video acquisition equipment is provided with a video identification model, an audio clip determination model and a question and answer mark model. The video recording device includes an image sensor and an audio sensor for capturing a video stream including the content of the target user's question and answer. The embodiment of the present application does not limit the type of the video recording device. The video recording device can be a camera, a robot, a personal computer, an intelligent television, an intelligent sound box and other intelligent devices. The video recording device has basic service functions, and also has functions of calculation, communication, internet access and the like.

In a news release conference scene, the video recording equipment records the video of the news release conference in the process of the news release conference. In the actual operation process, in order to improve the recognition and matching effects, the video recording equipment aims at the face of a speaker as much as possible in the video recording process so as to improve the recognition rate. After the video recording is finished, the video recording device can perform operations such as clipping on the video. The video recording equipment converts a video stream containing the question and answer content of the target user into an image stream and an audio stream; the video recording equipment respectively identifies corresponding speaking intervals of a target user in the image stream and the audio stream; the video recording equipment fuses the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepts a target audio segment corresponding to the target speaking interval from the audio stream, and improves the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; the video recording equipment performs text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, and the accuracy rate of extracting the text content from the video stream is high. In addition, the video recording device can also perform question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text.

After the video recording equipment acquires the question and answer text, the question and answer text is sent to the playing terminal. The playing terminal equipment displays the question and answer text for the news broadcasting personnel to check, and the text conversion rate of the news speaking scene is improved.

In this embodiment, for a specific manner of acquiring the question and answer text by the video recording device, reference may be made to the description of the foregoing embodiments, and details are not repeated here.

Fig. 3b is a flowchart illustrating a live broadcasting method according to an exemplary embodiment of the present application. As shown in fig. 3b, the method comprises:

s321: receiving a video stream which is sent by an intelligent terminal and contains the question and answer content of a target user;

s322: respectively identifying corresponding speaking intervals of a target user in an image stream and an audio stream in a video stream;

s323: obtaining the speaking text content of the target user according to the recognized corresponding speaking intervals of the target user in the image stream and the audio stream;

s324: performing question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text;

s325: receiving a target question sent by an intelligent terminal, and selecting a target answer corresponding to the target question from a question and answer text;

s326: and sending the target answer to the intelligent terminal so that the user can check the target answer.

In this embodiment, the execution subject of the live broadcasting method may be a server. A video recognition model, an audio clip determination model and a question and answer mark model are deployed in the server. The server can be a conventional server, a cloud host, a virtual center, and the like. The server device mainly comprises a processor, a hard disk, a memory, a system bus and the like, and a general computer architecture type. The server may include one web server or a plurality of web servers.

In a shopping live broadcast scene, the anchor can record a shopping live broadcast video in the live broadcast process by using an intelligent terminal used by the anchor. After the video recording is finished, the intelligent terminal can carry out operations such as clipping on the video; and the intelligent terminal sends the video stream containing the question and answer content of the target user to the server. The server converts the video stream containing the question and answer content of the target user into an image stream and an audio stream; the server respectively identifies corresponding speaking intervals of the target user in the image stream and the audio stream; the server fuses the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepts a target audio clip corresponding to the target speaking interval from the audio stream, and improves the identification rate of the audio clip corresponding to the speaking content of the target user by combining the image stream and the audio stream; the server performs text conversion on the target audio clip containing the speaking content of the target user to obtain the speaking text content of the target user, and the accuracy rate of extracting the text content from the video stream is high. In addition, the server can also perform question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text.

The application does not limit the use of the question and answer text, the question and answer text can be reserved by the platform, and the user can automatically answer the question in the future. And after receiving the target question sent by the intelligent terminal, the server selects a target answer corresponding to the target question from the question and answer text, and sends the target answer to the intelligent terminal for the user to check the target answer. In a live scene, the question and answer texts accumulated in the previous period are used as the original texts for automatically answering other user questions in the subsequent period; the intelligent degree of the live scene is provided, and the user experience is improved.

Fig. 4 is a schematic structural diagram of a text acquisition apparatus according to an exemplary embodiment of the present application. As shown in fig. 4, the text acquisition apparatus includes a conversion module 41, a recognition module 42, a truncation module 43, and a conversion module 44.

The conversion module 41 is configured to convert a video stream containing the question and answer content of the target user into an image stream and an audio stream;

the recognition module 42 is configured to recognize corresponding speaking intervals of the target user in the image stream and the audio stream respectively;

an intercepting module 43, configured to intercept, from the audio stream, a target audio segment containing the speaking content of the target user according to the identified speaking interval of the target user in the image stream and the audio stream;

and the conversion module 44 is configured to perform text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user.

The text acquiring device in the embodiment of the present application further includes a labeling module 45, configured to perform question and answer labeling on the text content according to a pre-learned manner of inputting a question and answer labeling model to the text content, so as to obtain a question and answer text.

Optionally, the identifying module 42, when identifying the speaking intervals corresponding to the target user in the image stream and the audio stream respectively, is specifically configured to: recognizing a speaking interval of a target user in an image stream according to the image characteristics of the target user; and recognizing the speaking interval of the target user in the audio stream according to the voiceprint characteristics of the target user.

Optionally, when the recognition module 42 recognizes the speaking interval of the target user in the image stream according to the image feature of the target user, it is specifically configured to: inputting the image stream and the image characteristics containing the target user into a video recognition model; extracting an image stream segment containing the image characteristics of the target user according to the image characteristics of the target user in the video identification model; and identifying the speaking interval of the target user in the image stream from the image stream segment containing the image characteristics of the target user by using a convolutional neural network algorithm.

Optionally, the recognition module 42, before inputting the image stream and the image features containing the target user into the video recognition model, may be further configured to: and performing model training by using the image stream sample, the image characteristic sample containing the user and the speaking interval sample of the user in the image stream to obtain a video identification model.

Optionally, the recognition module 42 is specifically configured to, when performing model training by using the image stream sample, the image feature sample including the user, and the speech interval sample of the user in the image stream to obtain the video recognition model: extracting an image stream segment containing an image feature sample of a user from an image stream sample; inputting the image stream segment containing the image characteristic sample of the user and the speaking interval sample of the user in the image stream into a convolutional neural network algorithm, and establishing a mapping relation between the image stream segment containing the image characteristic sample of the user and the speaking interval of the user in the image stream to obtain a video identification model.

Optionally, when recognizing the speaking interval of the target user in the audio stream according to the voiceprint feature of the target user, the recognition module 42 is specifically configured to: inputting the audio stream and the voiceprint characteristics of the target user into an audio recognition model; extracting an audio clip containing the voiceprint characteristics of the target user in the audio recognition model; the speaking interval of the target user in the audio stream is identified from the audio segment containing the voiceprint characteristics of the target user.

Optionally, the recognition module 42, before inputting the audio stream and the voiceprint feature of the target user into the audio recognition model, is further operable to: and performing model training by using the audio stream sample, the voiceprint characteristic sample containing the user and the speaking interval sample of the user in the audio stream to obtain an audio recognition model.

Optionally, when the intercepting module 43 intercepts the target audio segment containing the speaking content of the target user from the audio stream according to the identified speaking interval of the target user in the image stream and the audio stream, the intercepting module is specifically configured to: and inputting the speaking intervals of the target user in the image stream and the audio stream into the audio clip determination model to obtain a target audio clip containing the speaking content of the target user.

Optionally, the intercepting module 43, before inputting the corresponding speaking intervals of the target user in the image stream and the audio stream into the audio segment determination model, may further be configured to: and performing model training by using the corresponding speaking intervals of the user in the image stream and the audio stream and the target audio clip containing the speaking content of the target user, and establishing a mapping relation between the corresponding speaking intervals of the user in the image stream and the audio stream and the target audio clip containing the speaking content of the target user to obtain an audio clip determination model.

Optionally, the labeling module 44 is specifically configured to, when performing question and answer labeling on the text content according to a pre-learned manner of inputting a question and answer labeling model to the text content to obtain a question and answer text: and inputting the text content into a question-answer marking model for question-answer marking to obtain a question-answer text.

Optionally, the tagging module 44 may be further operable to, prior to entering the textual content into the question-and-answer tagging model: and performing model training by using the text content sample and the question and answer text sample after the question and answer marking is performed on the text content sample, and establishing a mapping relation between the text content sample and the question and answer text sample to obtain a question and answer marking model.

Optionally, after the question-answering tagging is performed on the text content, the tagging module 44 may be further configured to: generating candidate question text fragments; judging whether a target problem text segment with the similarity between the target problem text segment and the candidate problem text segment larger than a set similarity threshold exists in the preset problem text segments; if yes, the target question text segment is used as a question text in the question and answer text; if not, the candidate question text segments are used as question texts in the question and answer texts.

Optionally, before determining whether there is a target question text segment in the preset question text segments, where the similarity between the target question text segment and the candidate question text segment is greater than the set similarity threshold, the marking module 44 may further be configured to: acquiring a problem text fragment existing in a text form; or, performing text conversion on the voice question segment in the voice form to obtain a question text segment.

In the above apparatus embodiment of the present application, first, a text acquisition device converts a video stream containing question and answer content of a target user into an image stream and an audio stream; then, respectively identifying corresponding speaking intervals of the target user in the image stream and the audio stream; then, fusing the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepting a target audio segment corresponding to the target speaking interval from the audio stream, and improving the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; and finally, performing text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, wherein the accuracy rate of extracting the text content from the video stream is high.

Fig. 5 is a schematic structural diagram of a text acquisition device according to an exemplary embodiment of the present application. As shown in fig. 5, the text acquisition device includes a memory 501 and a processor 502. In addition, the text acquisition device also includes necessary components such as a power component 503, a communication component 504, an audio component 505, and a video component 506.

Memory 501 is used to store computer programs and may be configured to store other various data to support operations on the text acquisition device. Examples of such data include instructions for any application or method operating on the text acquisition device.

The memory 501, which may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

A communication component 504 for data transmission with other devices.

The processor 502, which may execute computer instructions stored in the memory 501, is configured to: converting a video stream containing the question and answer content of the target user into an image stream and an audio stream; respectively identifying corresponding speaking intervals of a target user in the image stream and the audio stream; intercepting a target audio clip containing the speaking content of the target user from the audio stream according to the corresponding speaking intervals of the identified target user in the image stream and the audio stream; and performing text conversion on the target audio clip containing the speaking content of the target user to obtain the speaking text content of the target user.

Optionally, the processor 502, after obtaining the text content of the target user speaking, may further be configured to: and performing question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text.

Optionally, when recognizing the speaking intervals of the target user in the image stream and the audio stream respectively, the processor 502 is specifically configured to: recognizing a speaking interval of a target user in an image stream according to the image characteristics of the target user; and recognizing the speaking interval of the target user in the audio stream according to the voiceprint characteristics of the target user.

Optionally, when the processor 502 identifies the speaking interval of the target user in the image stream according to the image feature of the target user, it is specifically configured to: inputting the image stream and the image characteristics containing the target user into a video recognition model; extracting an image stream segment containing the image characteristics of the target user according to the image characteristics of the target user in the video identification model; and identifying the speaking interval of the target user in the image stream from the image stream segment containing the image characteristics of the target user by using a convolutional neural network algorithm.

Optionally, the processor 502, before inputting the image stream and the image features containing the target user into the video recognition model, may be further configured to: and performing model training by using the image stream sample, the image characteristic sample containing the user and the speaking interval sample of the user in the image stream to obtain a video identification model.

Optionally, when performing model training by using the image stream sample, the image feature sample including the user, and the speech interval sample of the user in the image stream to obtain the video recognition model, the processor 502 is specifically configured to: extracting an image stream segment containing an image feature sample of a user from an image stream sample; inputting the image stream segment containing the image characteristic sample of the user and the speaking interval sample of the user in the image stream into a convolutional neural network algorithm, and establishing a mapping relation between the image stream segment containing the image characteristic sample of the user and the speaking interval of the user in the image stream to obtain a video identification model.

Optionally, when recognizing the speaking interval of the target user in the audio stream according to the voiceprint feature of the target user, the processor 502 is specifically configured to: inputting the audio stream and the voiceprint characteristics of the target user into an audio recognition model; extracting an audio clip containing the voiceprint characteristics of the target user in the audio recognition model; the speaking interval of the target user in the audio stream is identified from the audio segment containing the voiceprint characteristics of the target user.

Optionally, before the processor 502 inputs the audio stream and the voiceprint feature of the target user into the audio recognition model, the method further comprises: and performing model training by using the audio stream sample, the voiceprint characteristic sample containing the user and the speaking interval sample of the user in the audio stream to obtain an audio recognition model.

Optionally, when the processor 502 intercepts a target audio segment containing the speaking content of the target user from the audio stream according to the identified speaking interval of the target user in the image stream and the audio stream, the processor is specifically configured to: and inputting the speaking intervals of the target user in the image stream and the audio stream into the audio clip determination model to obtain a target audio clip containing the speaking content of the target user.

Optionally, the processor 502 may be further configured to, before inputting the corresponding speaking intervals of the target user in the image stream and the audio stream into the audio segment determination model: and performing model training by using the corresponding speaking intervals of the user in the image stream and the audio stream and the target audio clip containing the speaking content of the target user, and establishing a mapping relation between the corresponding speaking intervals of the user in the image stream and the audio stream and the target audio clip containing the speaking content of the target user to obtain an audio clip determination model.

Optionally, when the processor 502 performs question and answer tagging on the text content according to a pre-learned manner of inputting a question and answer tagging model to the text content to obtain a question and answer text, the processor is specifically configured to: and inputting the text content into a question-answer marking model for question-answer marking to obtain a question-answer text.

Optionally, the processor 502 may be further configured to, prior to entering the text content into the question-and-answer tagging model: and performing model training by using the text content sample and the question and answer text sample after the question and answer marking is performed on the text content sample, and establishing a mapping relation between the text content sample and the question and answer text sample to obtain a question and answer marking model.

Optionally, the processor 502, after performing question-answering tagging on the text content, is further configured to: generating candidate question text fragments; judging whether a target problem text segment with the similarity between the target problem text segment and the candidate problem text segment larger than a set similarity threshold exists in the preset problem text segments; if yes, the target question text segment is used as a question text in the question and answer text; if not, the candidate question text segments are used as question texts in the question and answer texts.

Optionally, before determining whether there is a target question text segment in the preset question text segments, where the similarity between the target question text segment and the candidate question text segment is greater than the set similarity threshold, the processor 502 may further be configured to: acquiring a problem text fragment existing in a text form; or, performing text conversion on the voice question segment in the voice form to obtain a question text segment.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 2 a.

Fig. 6 is a schematic structural diagram of a video capture device according to an exemplary embodiment of the present application. As shown in fig. 6, the video capture device includes a memory 601 and a processor 602. In addition, the video capture device includes the necessary components of a power component 603, a communication component 604, an audio component 605, and a video component 606.

The memory 601 is used for storing computer programs and may be configured to store other various data to support operations on the video capture device. Examples of such data include instructions for any application or method operating on the video capture device.

The memory 601, which may be implemented by any type of volatile or non-volatile memory device or combination thereof, may include, for example, Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

A communication component 604 for data transmission with other devices.

Processor 602, which may execute computer instructions stored in memory 601, to: collecting a video stream containing the question and answer content of a target user on a news release site; respectively identifying corresponding speaking intervals of a target user in an image stream and an audio stream in a video stream; obtaining the speaking text content of the target user according to the recognized corresponding speaking intervals of the target user in the image stream and the audio stream; performing question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text; and sending the question and answer text to the playing terminal so that the playing terminal can display the question and answer.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 3 a.

Fig. 7 is a schematic structural diagram of a server according to an exemplary embodiment of the present application. As shown in fig. 7, the server includes a memory 701 and a processor 702. In addition, the server includes necessary components such as a power component 703 and a communication component 704.

A memory 701 for storing a computer program and may be configured to store other various data to support operations on the server. Examples of such data include instructions for any application or method operating on the server.

The memory 701 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 704 for data transmission with other devices.

Processor 702, which may execute computer instructions stored in memory 701 for: receiving a video stream which is sent by an intelligent terminal and contains the question and answer content of a target user; respectively identifying corresponding speaking intervals of a target user in an image stream and an audio stream in a video stream; obtaining the speaking text content of the target user according to the recognized corresponding speaking intervals of the target user in the image stream and the audio stream; performing question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text; receiving a target question sent by an intelligent terminal, and selecting a target answer corresponding to the target question from a question and answer text; and sending the target answer to the intelligent terminal so that the user can check the answer.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 3 b.

In the above device embodiment of the present application, first, the text acquisition device converts a video stream containing the question and answer content of the target user into an image stream and an audio stream; then, respectively identifying corresponding speaking intervals of the target user in the image stream and the audio stream; then, fusing the corresponding speaking intervals of the identified target user in the image stream and the audio stream to determine a target speaking interval, intercepting a target audio segment corresponding to the target speaking interval from the audio stream, and improving the identification rate of the audio segment corresponding to the speaking content of the target user by combining the image stream and the audio stream; and finally, performing text conversion on the target audio segment containing the speaking content of the target user to obtain the speaking text content of the target user, wherein the accuracy rate of extracting the text content from the video stream is high.

The communication components of fig. 5-7 described above are configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The audio components of fig. 5-7 described above may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A text acquisition method, comprising:

2. The method of claim 1, wherein after obtaining the textual content of the target user utterance, the method further comprises:

and performing question and answer marking on the text content according to a pre-learned mode of inputting a question and answer marking model to the text content to obtain a question and answer text.

3. The method of claim 1, wherein identifying the corresponding speaking intervals of the target user in the image stream and the audio stream respectively comprises:

recognizing a speaking interval of a target user in an image stream according to the image characteristics of the target user;

and recognizing the speaking interval of the target user in the audio stream according to the voiceprint characteristics of the target user.

4. The method of claim 3, wherein identifying the speaking interval of the target user in the image stream according to the image characteristics of the target user comprises:

inputting the image stream and the image characteristics containing the target user into a video recognition model;

extracting an image stream segment containing the image characteristics of a target user according to the image characteristics of the target user in the video identification model;

and identifying the speaking interval of the target user in the image stream from the image stream segment containing the image characteristics of the target user by using a convolutional neural network algorithm.

5. The method of claim 4, wherein prior to inputting the image stream and the image features comprising the target user into the video recognition model, the method further comprises:

and performing model training by using the image stream sample, the image characteristic sample containing the user and the speaking interval sample of the user in the image stream to obtain a video identification model.

6. The method of claim 5, wherein performing model training using the image stream samples, the image feature samples including the user, and the speech interval samples of the user in the image stream to obtain the video recognition model comprises:

extracting an image stream segment containing an image feature sample of a user from the image stream sample;

inputting the image stream segment containing the image characteristic sample of the user and the speaking interval sample of the user in the image stream into a convolutional neural network algorithm, and establishing a mapping relation between the image stream segment containing the image characteristic sample of the user and the speaking interval of the user in the image stream to obtain a video identification model.

7. The method of claim 3, wherein identifying the speaking interval of the target user in the audio stream according to the voiceprint characteristics of the target user comprises:

inputting the audio stream and the voiceprint characteristics of the target user into an audio recognition model;

extracting an audio clip containing the voiceprint characteristics of the target user in the audio recognition model;

and identifying the speaking interval of the target user in the audio stream from the audio segment containing the voiceprint characteristics of the target user.

8. The method of claim 7, wherein prior to inputting the audio stream and the voiceprint characteristics of the target user into an audio recognition model, the method further comprises:

and performing model training by using the audio stream sample, the voiceprint characteristic sample containing the user and the speaking interval sample of the user in the audio stream to obtain an audio recognition model.

9. The method of claim 1, wherein intercepting a target audio segment containing the speaking content of the target user from the audio stream according to the corresponding speaking interval of the identified target user in the image stream and the audio stream comprises:

and inputting the speaking intervals of the target user in the image stream and the audio stream into an audio clip determination model to obtain a target audio clip containing the speaking content of the target user.

10. The method of claim 9, wherein prior to inputting the corresponding speaking intervals of the target user in the image stream and the audio stream into the audio clip determination model, the method further comprises:

and performing model training by using the corresponding speaking intervals of the user in the image stream and the audio stream and the target audio clip containing the speaking content of the target user, and establishing a mapping relation between the corresponding speaking intervals of the user in the image stream and the audio stream and the target audio clip containing the speaking content of the target user to obtain an audio clip determination model.

11. The method according to claim 2, wherein the step of performing question and answer tagging on the text content according to a pre-learned manner of inputting a question and answer tagging model on the text content to obtain a question and answer text comprises:

and inputting the text content into a question-answer marking model for question-answer marking to obtain a question-answer text.

12. The method of claim 11, wherein prior to entering the textual content into a question-and-answer tagging model, the method further comprises:

and performing model training by using the text content sample and the question and answer text sample after the question and answer marking is performed on the text content sample, and establishing a mapping relation between the text content sample and the question and answer text sample to obtain a question and answer marking model.

13. The method of claim 12, wherein after the question-and-answer tagging is performed on the text content, the method further comprises:

generating candidate question text fragments;

judging whether a target problem text segment with the similarity between the target problem text segment and the candidate problem text segment larger than a set similarity threshold exists in preset problem text segments or not;

if yes, the target question text segment is used as a question text in the question and answer text;

if not, the candidate question text segments are used as question texts in the question and answer texts.

14. The method according to claim 13, wherein before determining whether there is a target question text segment in the preset question text segments, wherein the similarity between the target question text segment and the candidate question text segment is greater than a set similarity threshold, the method further comprises:

acquiring a problem text fragment existing in a text form;

or, performing text conversion on the voice question segment in the voice form to obtain a question text segment.

15. An information processing apparatus characterized by comprising: one or more processors and one or more memories storing computer programs;

the one or more processors to execute the computer program to:

16. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts comprising:

17. A live broadcast method, comprising:

and sending the target answer to an intelligent terminal so that a user can check the target answer.

18. A server, comprising: one or more processors and one or more memories storing computer programs;

the one or more processors to execute the computer program to:

19. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts comprising: