CN107578777B

CN107578777B - Text information display method, device and system, and voice recognition method and device

Info

Publication number: CN107578777B
Application number: CN201610523339.9A
Authority: CN
Inventors: 高杰; 周躜
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-07-05
Filing date: 2016-07-05
Publication date: 2021-08-03
Anticipated expiration: 2036-07-05
Also published as: CN107578777A

Abstract

A text information display method, device and system, voice recognition method and device; the text information display method is preset with a corresponding relation between a source identifier and a terminal, and comprises the following steps: respectively carrying out voice recognition on the collected multi-channel voice data from different sources; determining text information and source identification corresponding to the multiple paths of voice data; and respectively displaying the text information corresponding to the multi-channel voice data in one or more terminals corresponding to the source identifiers corresponding to the multi-channel voice data according to the corresponding relation. The method and the device can quickly and accurately display the text information corresponding to the voice data of a certain source or some sources in the terminal.

Description

Text information display method, device and system, and voice recognition method and device

Technical Field

The invention relates to the field of multimedia, in particular to a text information display method, a text information display device, a text information display system, a voice recognition method and a voice recognition device.

Background

Speech recognition is a technique that converts speech data into corresponding text information using a method of machine learning. Through voice recognition, voice can be presented in text form in the case of transferring voice messages, conversations, conferences, and the like.

The scheme in the related art has at least the following disadvantages:

under the condition that multiple parties participate in a conversation, if text information converted from the speech of a certain user or users needs to be displayed in different terminals, the text information is usually finished in a manual sorting mode; however, the manual arrangement method needs a long time, cannot meet the real-time requirement, and due to the recognition capability of human beings on voice, there is a possibility of errors in the speech discrimination of different users.

Disclosure of Invention

The application provides a text information display method, a text information display device, a text information display system, a voice recognition method and a voice recognition device, which can quickly and accurately display text information corresponding to voice data of a certain source or some sources in a terminal.

The technical scheme is as follows.

A text information display method is preset with a corresponding relation between a source identifier and a terminal, and comprises the following steps:

respectively carrying out voice recognition on the collected multi-channel voice data from different sources;

determining text information and source identification corresponding to the multiple paths of voice data;

and respectively displaying the text information corresponding to the multi-channel voice data in one or more terminals corresponding to the source identifiers corresponding to the multi-channel voice data according to the corresponding relation.

The displaying the text information corresponding to the multiple paths of voice data in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data, respectively, may include:

and respectively converting the text information corresponding to the multi-channel voice data into subtitles, and displaying the subtitles in one or more terminals corresponding to the source identifiers corresponding to the multi-channel voice data.

Wherein the converting the text information corresponding to the multiple paths of voice data into subtitles may include:

and in the text information corresponding to each of the multiple paths of voice data, converting the text information corresponding to the voice data corresponding to the preset source identification into a subtitle.

Wherein, the method can also comprise the following steps:

and determining the corresponding relation between the source identification and the terminal according to the video played by the terminal.

Wherein, the determining the source identifier corresponding to each of the multiple paths of voice data may include:

respectively determining source identifications corresponding to the multiple paths of voice data according to the acquisition ends of the multiple paths of voice data; or respectively determining the source identifiers corresponding to the multiple paths of voice data by performing voice recognition on the multiple paths of voice data.

Wherein, the performing speech recognition on the collected multiple paths of speech data from different sources respectively may include:

the collected multi-path voice data from different sources are sent to a cloud for voice recognition, and text information obtained by voice recognition of the multi-path voice data is received from the cloud.

Wherein, after determining the text information and the source identifier corresponding to each of the multiple paths of voice data, the method may further include:

correspondingly storing the text information and the source identification corresponding to the multi-channel voice data in a database;

the displaying, according to the correspondence, the text information corresponding to each of the multiple paths of voice data in one or more of the terminals corresponding to the source identifiers corresponding to each of the multiple paths of voice data may include:

determining source identifiers corresponding to different terminals according to the corresponding relation;

and respectively acquiring text information corresponding to the source identifiers corresponding to different terminals from the database, and displaying the text information in the corresponding terminals.

A speech recognition method comprising:

carrying out voice recognition on the collected one or more paths of voice data;

and determining the text information and the source identification corresponding to the one or more paths of voice data respectively.

Wherein, the performing voice recognition on the collected one or more paths of voice data may include:

and sending the collected one or more paths of voice data to a cloud for voice recognition, and receiving text information obtained by performing voice recognition on the one or more paths of voice data from the cloud.

Wherein, after determining the text information and the source identifier corresponding to each of the one or more voice data, the method may further include:

and correspondingly storing the text information and the source identification corresponding to the one or more paths of voice data in a database.

A text information display method comprising:

acquiring text information corresponding to one or more source identifiers; the text information corresponding to the source identification is obtained by carrying out voice recognition on the voice data corresponding to the source identification;

and displaying the text information corresponding to the one or more source identifiers in the terminal corresponding to the one or more source identifiers according to the preset corresponding relationship between the source identifiers and the terminal.

The displaying the text information corresponding to the one or more source identifiers in the terminal corresponding to the one or more source identifiers may include:

and respectively converting the text information corresponding to the one or more source identifiers into subtitles, and displaying the subtitles in the terminals corresponding to the one or more source identifiers.

The obtaining of the text information corresponding to the one or more source identifiers may include:

and acquiring the text information corresponding to one or more source identifiers from a database correspondingly storing the source identifiers and the text information.

and acquiring text information corresponding to one or more preset source identifiers according to the corresponding relation between the preset source identifiers and the terminal.

A text information transmission method is characterized in that text information and source identification which correspond to each other are pre-stored, wherein the text information corresponding to the source identification is obtained by carrying out voice recognition on voice data corresponding to the source identification; the method comprises the following steps:

respectively determining source identifiers corresponding to different terminals according to the preset corresponding relationship between the source identifiers and the terminals;

and respectively sending text information corresponding to the source identification corresponding to the terminal to different terminals.

A text information display system comprising:

one or more speech recognition devices, one or more text information display devices;

the at least one voice recognition device is used for performing voice recognition on one or more paths of collected voice data and determining text information and source identification corresponding to the one or more paths of collected voice data;

and the at least one text information display device is used for displaying the text information corresponding to one or more paths of voice data in the terminal corresponding to the source identifier corresponding to the one or more paths of voice data according to the preset corresponding relationship between the source identifier and the terminal.

The displaying the text information corresponding to the one or more paths of voice data in the terminal corresponding to the source identifier corresponding to the one or more paths of voice data may include:

and converting text information corresponding to one or more paths of voice data into subtitles, and displaying the subtitles in a terminal corresponding to a source identifier corresponding to the one or more paths of voice data.

The voice recognition device may perform voice recognition on one or more paths of collected voice data, including:

and sending one or more paths of collected voice data to a cloud for voice recognition, and receiving text information obtained by performing voice recognition on the one or more paths of voice data from the cloud.

At least one voice recognition device can be further used for correspondingly storing text information and source identification corresponding to the one or more paths of voice data in a database;

at least one text information display device can be used for acquiring text information corresponding to one or more paths of voice data from the database.

The acquiring, by the text information display apparatus, the text information corresponding to one or more paths of voice data from the database may include:

and acquiring text information corresponding to one or more preset source identifiers from the database according to the corresponding relation between the preset source identifiers and the terminal.

A speech recognition apparatus comprising: a memory and a processor;

the memory is used for storing a program for voice recognition; the program for speech recognition, when read and executed by the processor, performs the following operations:

A speech acquisition device comprising: a microphone, a memory, a processor;

the microphone is used for collecting voice data;

performing voice recognition on voice data collected by the microphone;

and determining text information and a source identifier corresponding to the voice data.

A text information processing apparatus comprising: a memory, a processor;

the memory is used for storing a program for displaying text information; the program for text information display, when read and executed by the processor, performs the following operations:

A text information display apparatus comprising: the display screen, the memory and the processor;

acquiring text information corresponding to a preset source identifier; the text information corresponding to the source identification is obtained by carrying out voice recognition on the voice data corresponding to the source identification;

and displaying the acquired text information in the display screen.

A text information display system is provided with a corresponding relation between a source identifier and a terminal in advance, and the system comprises:

the voice recognition module is used for respectively carrying out voice recognition on the collected multi-channel voice data from different sources;

the determining module is used for determining the text information and the source identification which correspond to the multi-channel voice data respectively;

and the display module is used for respectively displaying the text information corresponding to the multi-channel voice data in one or more terminals corresponding to the source identifiers corresponding to the multi-channel voice data according to the corresponding relation.

The application includes the following advantages:

according to at least one embodiment of the application, text information obtained by performing voice recognition on voice data from different sources can be respectively displayed in corresponding terminals according to source identifiers of the voice data, and the text information corresponding to the voice data from a certain source or sources can be quickly and accurately displayed in the terminals due to the fact that the text information can be distinguished from the sources of the collected voice data.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

Fig. 1 is a flowchart of a text information display method of embodiment 1;

FIG. 2 is a system architecture diagram of application embodiment 1;

FIG. 3 is a flowchart of a speech recognition method of embodiment 2;

fig. 4 is a flowchart of a text information display method of embodiment 3;

fig. 5 is a flowchart of a text information transmission method of embodiment 4;

fig. 6 is a schematic view of a text information display system of embodiment 5;

fig. 7 is a schematic view of a text information display system of embodiment 10;

fig. 8 is a schematic diagram of a subtitle appending system of embodiment 15;

FIG. 9 is a schematic illustration of one of the implementations of embodiment 15;

FIG. 10 is a schematic diagram of a second implementation of embodiment 15.

Detailed Description

The technical solutions of the present application will be described in more detail below with reference to the accompanying drawings and embodiments.

It should be noted that, if not conflicted, the embodiments and the features of the embodiments can be combined with each other and are within the scope of protection of the present application. Additionally, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Herein, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM).

Computer readable media include both permanent and non-permanent, removable and non-removable storage media. A storage medium may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Example 1

A text information display method is preset with a corresponding relation between a source identifier and a terminal, and as shown in figure 1, the method comprises steps S110-S130.

S110, respectively carrying out voice recognition on the collected multi-channel voice data from different sources;

s120, determining text information and source identification corresponding to the multi-channel voice data;

and S130, respectively displaying the text information corresponding to the multiple paths of voice data in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data according to the corresponding relation.

A system architecture applying the method of the present embodiment is shown in fig. 2, and the system may include one or more identification end devices 21 and one or more display end devices 22;

the recognition end device 21 can perform speech recognition on the multiple paths of speech data from different sources collected by the speech collecting device, and determine text information and source identifiers corresponding to the multiple paths of speech data.

Wherein, respectively carrying out voice recognition on the collected multi-channel voice data from different sources can comprise:

the collected multi-path voice data from different sources are respectively sent to the cloud for voice recognition, and text information obtained by voice recognition of the multi-path voice data is received from the cloud.

The cloud may be an independently deployed server or a server cluster with voice recognition capability, and specifically may be a server or a server cluster operating in cloud computing.

After determining the text information and the source identifier corresponding to each of the multiple paths of voice data, the method may further include:

and correspondingly storing the text information and the source identification corresponding to the multi-channel voice data in a database. The recognized text information and the source identification are stored in the database, so that the storage capacity of the recognition device can be reduced, and the system is also suitable for the system architecture shown in fig. 2.

The display device 22 may display the text information corresponding to the multiple paths of voice data in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data, respectively, according to the above correspondence. The display device displays the text information on the terminal, specifically, the display device actively obtains all the text information and corresponding source identifiers from the database, and displays the text information on the terminal based on the corresponding relationship, or actively obtains the text information to be displayed from the database according to the corresponding relationship, or actively pushes all the text information to the display device by the database to perform the above operations, or pushes specific text information to each display device according to the corresponding relationship, and may have different processing modes according to different application scenarios.

In an implementation manner, displaying, according to the correspondence relationship, text information corresponding to each of the multiple paths of voice data in one or more of the terminals corresponding to the source identifiers corresponding to each of the multiple paths of voice data may specifically include:

When the text information is acquired from the database, the acquisition can be performed according to the sequence of the acquisition time of the voice data corresponding to the text information, or according to the sequence from high to low of the priority of the source identifier corresponding to the text information.

In the system architecture shown in fig. 2, the multiple voice data from different sources may refer to voice data collected by different voice collecting devices, for example, the voice data collected by each microphone is taken as a path of voice data, or the voice data collected by each microphone set is taken as a path of voice data (one microphone set includes one or more microphones). The multi-channel voice data from different sources can also refer to voice data which are acquired by the same voice acquisition equipment and belong to different users; for example, when multiple users use the same microphone to speak, the voice data collected by one microphone can be divided into multiple paths of voice data from different sources through voiceprint recognition. Here, the source identification may specifically refer to a speaker. Of course, in other scenarios, the source identifier may refer to a class of users or a group of users, for example, in a multiparty communication conference, each conference participant may have multiple persons, and then the source identifier may specifically refer to a conference participant, rather than to one of the participants, and so on.

In the system architecture shown in fig. 2, the voice collecting device and the recognition end device 21 may be different devices, for example, a microphone externally connected to the recognition end device 21 is used as the voice collecting device. The voice collecting device and the recognition end device 21 may also be the same device, such as a smart phone or a tablet computer, which has a microphone and can realize the function of the recognition end device 21.

In the system architecture shown in fig. 2, the terminal and the display device 22 may be different devices, for example, a display screen externally connected to the display device 22 is used as the terminal. The terminal and the display device 22 may also be the same device, such as a smart phone or a tablet computer, which has a display screen and can implement the function of the display device 22.

In the system architecture shown in fig. 2, each of the recognition-side devices 21 may perform speech recognition on only one path of speech data. Each recognition-side device 21 may also perform speech recognition on multiple paths of speech data by running multiple recognition threads.

In the system architecture shown in fig. 2, each display device 22 corresponds to a terminal, and obtains corresponding text information from the database according to a source identifier corresponding to the corresponding terminal and displays the text information in the corresponding terminal. Each display terminal device 22 may also correspond to one or more terminals, and may obtain corresponding text information from the database and display the text information in the corresponding terminal according to the source identifier corresponding to the different terminal by running a plurality of text information display threads.

In an embodiment, the displaying the text information corresponding to the multiple paths of voice data in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data respectively may specifically include:

and respectively converting text information corresponding to the multiple paths of voice data into subtitles, and displaying the subtitles in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data. The text information recognized by voice is converted into the subtitle, specifically, the text information recognized by voice is converted into the file information in the subtitle format, so as to be displayed in a synthesized manner with the video displayed on the terminal, in the generated subtitle, the related information of the source identifier can be added, for example, when the source identifier is a certain person, the information of the person can be added, for example, the voice of the lissajous speech, lissajous speech can be added in the subtitle, so as to prompt other users that the words of lissajous speech are provided, for example, when the source identifier is a device or a group of users, the related information can also be added in the subtitle.

In a live broadcast application scene, video subtitle matching can be performed in a live broadcast process, for example, in occasions such as conferences, court trial, release meetings and the like, text information is obtained through real-time voice recognition and converted into subtitles after voice data of a user is collected; the subtitles are displayed in the terminal together with video data collected for the user. Since the length of time between the generation of the voice data and the final display of the subtitles can be as small as the viewer cannot perceive it, the viewer perceives a live video containing the subtitles.

The manner in which the subtitles are displayed with the video data may include, but is not limited to: generating caption images of pure green backgrounds, and synthesizing the caption images with images in video data; or generating a plug-in subtitle file and loading the plug-in subtitle file when the video data is played. In some application scenarios, only text information may be displayed instead of video. Text information can also be displayed in a manner similar to a bullet screen during the playing of video data.

In some embodiments, the subtitles and the video may be combined by the terminal, or may be displayed by another device after the combination is completed.

In some embodiments, the subtitle conversion and composition may be implemented by the same device or by different devices.

In some embodiments, the device that completes the composition may perform the composition only for the video and the subtitle to be played by one terminal, or may perform the composition separately for different terminals by running multiple composition threads.

In some embodiments, the subtitles played by each terminal may be the same but different in video, or the subtitles played by each terminal may be different but the same in video; or the played video and the played subtitle are different or the same. Under different application scenes, different display processing modes can be provided.

In some embodiments, a source identifier may be added to the converted subtitles; for example, one path of voice data corresponds to the source identifier "user a", and after text information corresponding to the path of voice data is converted into a subtitle, a prefix or a suffix "user a" may be added to the subtitle. When the source identifier is a person as described above, the name of the person, or the name of a group of users, etc. may be added to the caption.

In some embodiments, the display parameters of the subtitles may be set according to the source identifier, and when the source identifier is different, the display parameters of the subtitles are also different. For example, the source identifiers of the two paths of voice data are respectively 'user a' and 'user B', and after text information corresponding to the two paths of voice data is respectively converted into two subtitles, one subtitle can be blue, and the other subtitle can be red. Wherein the display parameters may include one or more of: color, font size, format (such as, but not limited to, bold, slant, underline, highlight, etc.).

The source identifiers are added during display, and the user represented by each source identifier can clearly see what the user says respectively through the subtitles.

In some embodiments, converting the text information corresponding to each of the plurality of voice data into the subtitles may include:

in the text information corresponding to each of the plurality of paths of voice data, the text information corresponding to the voice data corresponding to the predetermined source identifier is converted into subtitles.

For example, when it is required to display subtitles only for voice data of a certain user or users, the source identifier corresponding to the voice data of the user or users may be set as the predetermined source identifier.

In some embodiments, the text information corresponding to each of the multiple paths of voice data may be completely converted into subtitles, and when it is required to display subtitles generated by a part of the source voice data, a part of the subtitles may be selected from the converted subtitles according to the source identifier for display.

In one embodiment, the method may further include: and determining the corresponding relation between the source identifier and the terminal according to the video played by the terminal.

The determination of the correspondence relationship may be performed before step S130 so as to be preset information. In the corresponding relationship, the corresponding manner between the source identifier and the terminal may include any of the following: many-to-many, many-to-one, one-to-many, one-to-one.

In some embodiments, by setting the correspondence relationship, it can be set which text information corresponding to voice data is to be displayed by different terminals, and it can also be set which terminal text information corresponding to voice data from different sources is to be displayed in.

In some embodiments, the correspondence may be statically configured, such as that each terminal fixedly displays text information corresponding to one or more channels of voice data. The correspondence may also be dynamically adjustable; for example, the corresponding relationship may be modified manually, or the corresponding relationship may be updated adaptively according to the source or content of the video played by the terminal.

The static configuration is as follows: when a certain terminal plays video data collected for the user a, or a video source of the terminal is a camera for collecting the user a, a source identifier corresponding to voice data collected for the user a may be corresponding to the terminal.

The case of adaptively updating the corresponding relationship according to the content of the video played by the terminal is as follows: for example, through a face recognition technology, it is found that a person in a video played by a certain terminal is changed from a user a to a user B, a source identifier corresponding to voice data acquired by the user B can be added to a source identifier corresponding to the terminal, and a source identifier corresponding to voice data acquired by the user a can be deleted. For example, when a video played by a certain terminal has a marker of a certain area through image recognition, the correspondence between the source identifier corresponding to the voice data acquired by the area and the terminal can be increased in the source identifier corresponding to the terminal.

The case of adaptively updating the corresponding relationship according to the source of the video played by the terminal is as follows: when a certain terminal switches a video source from the camera X to the camera Y, a source identifier corresponding to the voice data of the acquisition area or object from the camera Y may be added to the source identifier corresponding to the terminal, and a source identifier corresponding to the voice data of the acquisition area or object from the camera Y may be deleted.

In this optional embodiment, the corresponding relationship may be stored in a centralized manner, for example, the server stores the corresponding relationship, that is: the server may know the source identifiers corresponding to different terminals (or terminals corresponding to different source identifiers). The corresponding relationship may also be stored in a distributed manner, for example, each device stores only one or more source identifiers corresponding to one terminal, or stores only one or more terminals corresponding to one source identifier.

In other optional embodiments, the source identifier corresponding to the terminal may also be unrelated to the video played by the terminal; for example, no matter what kind of video is played, or no matter whether the video is played, the corresponding relation between the source identifier and the terminal is not affected.

In an optional implementation manner of this embodiment, the determining the source identifiers corresponding to the multiple paths of voice data includes:

This alternative embodiment allows for a fast determination of the source identity without substantially increasing additional effort.

In this alternative embodiment, the various possible time nodes for determining the source identifier of the voice data include, but are not limited to, one or more of the following: when voice data is collected, before voice recognition is carried out after the voice data is collected, and after the voice recognition is carried out.

In this optional embodiment, when the source identifier is determined according to the acquisition end, the source identifier may be obtained according to a hardware identifier of the acquisition end itself. For example, when the voice data is collected by using an intelligent terminal such as a Mobile phone or a tablet computer, the source identifier may be an identifier of the intelligent terminal, such as but not limited to an IMSI (International Mobile Subscriber identity Number). For example, after the microphone is used for collecting voice data, the voice data is sent through the Bluetooth module, and the source identifier can be the identifier of the Bluetooth module. Or the source identifier may also be determined according to an identifier of software running on the acquisition end, for example, the source identifier is determined according to an identifier of the acquisition client.

Similarly, in this alternative, when determining the source identifier by voice recognition, the source identifier may be determined according to a hardware identifier of a recognition end device or an identifier of a running recognition client, or may be determined according to a user identity obtained by performing voiceprint recognition on voice data.

The identification of the acquisition client and the identification client can be distributed during initialization or during authentication of the acquisition client or the identification client.

One recognition client can correspond to only one path of voice data or multiple paths of voice data.

When one recognition client corresponds to multiple paths of voice data and the source identifier is the identifier of the recognition client, the source identifiers corresponding to the multiple paths of voice data are the same. For example, in a court trial site, three paths of voice data collected by three microphones aiming at an original report are all sent to the same recognition client side marked as an original report side, and source marks corresponding to the three paths of voice data are all marked as the original report side.

When one recognition client corresponds to one path of voice data and the source identifier is the identifier of the recognition client, the source identifiers corresponding to the multiple paths of voice data may be partially the same. For example, in a court trial site, three paths of voice data collected by three microphones aiming at an original notice are respectively sent to three recognition clients, and if the identifications of the three recognition clients are all 'original notices', the source identifications corresponding to the three paths of voice data are all 'original notices'.

In other alternative embodiments, the source identifier may be selected by the user, or an account number logged in by the user in the acquisition client or the recognition client may be used as the source identifier.

Example 2

A speech recognition method, as shown in FIG. 3, includes steps S210-S220.

S210, carrying out voice recognition on the collected one or more paths of voice data;

s220, determining the text information and the source identification corresponding to the one or more paths of voice data respectively.

In some embodiments, after determining the text information and the source identifier corresponding to each of the one or more channels of voice data, the text information corresponding to each of the multiple channels of voice data may be respectively displayed in one or more terminals corresponding to the source identifier corresponding to each of the multiple channels of voice data according to a preset correspondence between the source identifier and the terminal; implementation details can be found in example 1.

In some embodiments, the steps S210 to S220 may be performed by the identification-side device shown in fig. 2.

In some embodiments, the steps S210 to S220 may be performed by a device or a server running a plurality of identified clients.

In some optional embodiments, the performing voice recognition on the collected one or more channels of voice data may include:

In some optional embodiments, after determining the text information and the source identifier corresponding to each of the one or more voice data channels, the method may further include:

If the text information corresponding to the multiple paths of voice data is required to be respectively displayed in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data according to the preset corresponding relationship between the source identifiers and the terminals, the source identifiers corresponding to different terminals can be specifically determined according to the preset corresponding relationship between the source identifiers and the terminals; and respectively acquiring text information corresponding to the source identifiers corresponding to different terminals from the database, and displaying the text information in the corresponding terminals. For further details of the steps S210 to S220 of the present embodiment, reference may be made to embodiment 1.

Example 3

A text information display method, as shown in fig. 4, comprising steps S310 to S320:

s310, acquiring text information corresponding to one or more source identifiers; the text information corresponding to the source identification is obtained by carrying out voice recognition on the voice data corresponding to the source identification;

s320, displaying the text information corresponding to the one or more source identifiers in the terminal corresponding to the one or more source identifiers according to the preset corresponding relation between the source identifiers and the terminal.

The voice recognition may be performed on the voice data, and after the text information and the source identifier corresponding to the voice data are determined, the text information and the source identifier corresponding to the voice data are correspondingly stored. Details of the implementation of determining the source identifier and the text information corresponding to the voice data can be found in embodiment 1.

In some embodiments, the method of this embodiment may be performed by the display device shown in fig. 2.

In some embodiments, the steps S310 to S320 may be used for displaying text information of a terminal, that is: the one or more source identifiers obtained in step S310 correspond to a terminal.

In some embodiments, the steps S310 to S320 may be used for displaying text information of a plurality of terminals; namely: in step S310, source identifiers corresponding to a plurality of terminals are obtained.

In some embodiments, the displaying the text information corresponding to the one or more source identifiers in the terminal corresponding to the one or more source identifiers may include:

The conversion into the caption may be specifically to convert text information recognized by voice into file information in a caption format so as to be displayed in a synthesized manner with a video displayed on the terminal.

In some embodiments, the obtaining text information corresponding to one or more source identifiers may include:

The method includes the steps of obtaining all text information from a database actively, displaying the text information on a terminal based on a corresponding relationship, obtaining text information corresponding to a source identifier to be displayed from the database actively according to the corresponding relationship, and displaying the text information on the terminal, or pushing all text information to be displayed actively by the database, or pushing specific text information to different terminals respectively by the database according to the corresponding relationship, and having different processing modes according to different application scenarios.

After voice recognition is performed on voice data and text information and a source identifier corresponding to the voice data are determined, the text information and the source identifier corresponding to the voice data are correspondingly stored in a database so that corresponding text information can be obtained from the database during display.

In some embodiments, the obtaining text information corresponding to one or more source identifiers includes:

Wherein the predetermined one or more source identifiers may be source identifiers corresponding to predetermined one or more terminals in the corresponding relationship.

The method of this example can be used in conjunction with example 2.

For further details of the implementation of steps S310 to S320 in this embodiment, see embodiment 1.

Example 4

A text information transmission method is characterized in that text information and source identification which correspond to each other are pre-stored, wherein the text information corresponding to the source identification is obtained by carrying out voice recognition on voice data corresponding to the source identification; as shown in fig. 5, the method includes S410 to S420:

s410, respectively determining source identifiers corresponding to different terminals according to the preset corresponding relationship between the source identifiers and the terminals;

and S420, respectively sending text information corresponding to the source identification corresponding to the terminal to different terminals.

After the text information corresponding to the source identifier corresponding to the terminal is respectively sent to different terminals, the text information corresponding to the multiple paths of voice data of different sources can be respectively displayed in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data according to the preset corresponding relationship between the source identifiers and the terminals. The details of the implementation when shown can be seen in example 1. In some embodiments, the server storing the text information and the source identifier corresponding to each other may perform steps 410-420, so as to push the text information corresponding to the source identifier to different terminals for the source identifier corresponding to the different terminals.

The server may locally store the text information and the source identifier corresponding to each other, and may also obtain the text information and the source identifier corresponding to each other from other devices.

The server may locally store the corresponding relationship between the source identifier and the terminal, or may obtain the corresponding relationship between the source identifier and the terminal from other devices.

Example 5

A text information display system, as shown in fig. 6, comprising:

one or more speech recognition means 51, one or more text information display means 52;

at least one voice recognition device 51 is configured to perform voice recognition on one or more paths of collected voice data, and determine text information and a source identifier corresponding to each of the one or more paths of collected voice data;

at least one text information display device 52 is configured to display text information corresponding to one or more channels of voice data in a terminal corresponding to a source identifier corresponding to the one or more channels of voice data according to a preset correspondence between the source identifier and the terminal.

The connection line between the speech recognition device 51 and the text information display device 52 in fig. 6 only represents that there is a transfer relationship of text information between the two, but does not represent that the two must directly transfer text information or that the two must be in one-to-one correspondence.

In this embodiment, the speech recognition device 51 may be configured to execute steps S110 to S120 in embodiment 1; the speech recognition device 51 may be used as the recognition end device shown in fig. 2. The text information display device 52 may be configured to perform step S130 in embodiment 1; the text information display device 52 may be used as the display side device shown in fig. 2.

In some embodiments, only one recognition client may be run by one of the speech recognition devices 51. In some embodiments, one of the speech recognition devices 51 may run a plurality of recognition clients.

In some embodiments, one text information display device 52 may serve only one terminal, and the text information corresponding to one or more voice data channels is displayed in the terminal according to the source identifier corresponding to the terminal. In another application scenario, one text information display device 52 may serve multiple terminals, and respectively display text information corresponding to one or more paths of voice data in corresponding terminals according to source identifiers corresponding to different terminals.

In some embodiments, at least a portion of the speech recognition means 51 and at least a portion of the text information display means 52 may multiplex hardware resources.

In some embodiments, the displaying, by the text information display device 52, the text information corresponding to one or more voice data in the terminal corresponding to the source identifier corresponding to the one or more voice data may include:

In some embodiments, the voice recognition device 51 may perform voice recognition on one or more channels of collected voice data, including:

In some embodiments, at least one of the speech recognition devices 51 may be further configured to store the text information and the source identifier corresponding to each of the one or more voice data in a database;

at least one of the text information display devices 52 may be further configured to obtain text information corresponding to one or more voice data from the database.

The text information display device 52 may actively obtain all text information and corresponding source identifiers from the database, and display the text information on the terminal based on the corresponding relationship, or actively obtain text information to be displayed from the database according to the corresponding relationship, and display the text information on the terminal, or actively push all text information to the text information display device 52 by the database to perform the above operations, or push specific text information to each text information display device 52 by the database according to the corresponding relationship, and have different processing modes according to different application scenarios.

In some embodiments, the acquiring, by the text information display device 52, the text information corresponding to one or more channels of voice data from the database may include:

Wherein, when a text information display device 52 serves a terminal, the predetermined one or more source identifiers may be the source identifier corresponding to a predetermined terminal in the corresponding relationship; when one text information display device 52 serves a plurality of terminals, the predetermined one or more source identifiers may be source identifiers corresponding to the plurality of terminals in the correspondence relationship.

Further implementation details of the speech recognition device 51 and the text information display device 52 in this embodiment can be found in the description of the system architecture shown in fig. 2 in embodiment 1.

Example 6

A speech recognition apparatus comprising: a memory and a processor;

In some embodiments, the speech recognition apparatus may be used as the recognition end device shown in fig. 2.

In some embodiments, the performing voice recognition on the collected one or more channels of voice data may include:

In some embodiments, the determining the text information and the source identifier corresponding to each of the one or more voice data may further include:

If the text information corresponding to the multiple paths of voice data is required to be respectively displayed in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data according to the preset corresponding relationship between the source identifiers and the terminals, the source identifiers corresponding to different terminals can be specifically determined according to the preset corresponding relationship between the source identifiers and the terminals; and respectively acquiring text information corresponding to the source identifiers corresponding to different terminals from the database, and displaying the text information in the corresponding terminals.

The apparatus of this embodiment implements speech recognition and determines the details of the text information and the source identifier corresponding to the one or more channels of speech data according to embodiment 1.

Example 7

A speech acquisition device comprising: a microphone, a memory, a processor;

the microphone is used for collecting voice data;

performing voice recognition on voice data collected by the microphone;

The embodiment provides a device integrating collection and recognition, such as a smart phone or a tablet computer with a microphone, which can also be regarded as a "smart microphone" with processing capability.

In some embodiments, after determining the text information and the source identifier corresponding to the voice data, the text information corresponding to the voice data may be displayed in one or more terminals corresponding to the source identifier corresponding to the voice data according to a preset correspondence between the source identifier and the terminal; implementation details can be found in example 1.

In some embodiments, the voice collecting apparatus may be used as a device for integrating the voice collecting device and the recognition end device shown in fig. 2.

In this embodiment, details of performing voice recognition and determining text information and source identifiers corresponding to the voice data may be referred to in embodiment 1.

Example 8

A text information processing apparatus comprising: a memory, a processor;

Wherein, other devices (such as but not limited to the device of embodiment 6 or 7) can perform voice recognition on the voice data, determine the text information and the source identifier corresponding to the voice data, and correspondingly store the text information and the source identifier corresponding to the voice data. Details of the implementation of determining the source identifier and the text information corresponding to the voice data can be found in embodiment 1.

In some embodiments, the text information processing apparatus may be used as the display device shown in fig. 2.

Wherein the predetermined one or more source identifiers may be source identifiers corresponding to predetermined one or more terminals in the corresponding relationship. In this embodiment, according to a preset correspondence between source identifiers and terminals, implementation details of displaying text information corresponding to the one or more source identifiers in the terminals corresponding to the one or more source identifiers may be referred to in embodiment 1.

Example 9

and displaying the acquired text information in the display screen.

The embodiment provides a display device capable of displaying text information corresponding to corresponding voice data in a video according to a source identifier, wherein a predetermined source identifier is a source identifier corresponding to the display device. The display device may be, for example, a smart phone or a tablet computer, or may be regarded as a "smart display screen" with processing capabilities.

In some embodiments, the text information display apparatus may be a device that integrates the display device and the terminal shown in fig. 2.

In some embodiments, the memory in the text information display apparatus may be further configured to store a program for performing subtitle generation and composition; the displaying the acquired text information in the display screen may include:

the acquired text information is converted into subtitles and displayed together with video data received by the apparatus.

In this embodiment, according to a preset correspondence between source identifiers and terminals, implementation details of displaying text information corresponding to the one or more source identifiers in the terminals corresponding to the one or more source identifiers may be referred to in embodiment 1.

Embodiment 10, a processing apparatus, comprising:

a first device that is either one of the speech recognition device according to embodiment 6 and the speech acquisition device according to embodiment 7;

the second device is either one of the text information display device described in embodiment 8 and the display device described in embodiment 9.

Wherein the first device and the second device may multiplex the memory and/or the processor.

The processing device can be a smart phone or a tablet personal computer, is provided with a microphone and a display screen, and further comprises a network card, a Bluetooth module and other communication modules. In an application scenario, the processing device can acquire voice data of a user through a microphone, recognize the voice data as text information and send the text information to a database; in addition, the text information of the preset source identification is obtained from the database and is directly displayed or is displayed through a self-contained display screen after being converted into the subtitle. In another application scenario, the processing device may be externally connected with a microphone and a display screen; other operations are similar to the application scenario described above.

The details of the first device and the second device in this embodiment can be found in embodiments 6 to 9.

Example 11

A text information display system, as shown in fig. 7, in which a correspondence relationship between a source identifier and a terminal is preset, the system comprising:

the voice recognition module 71 is configured to perform voice recognition on the collected multiple paths of voice data from different sources;

a determining module 72, configured to determine text information and a source identifier corresponding to each of the multiple paths of voice data;

and a display module 73, configured to display, according to the correspondence, text information corresponding to each of the multiple paths of voice data in one or more terminals corresponding to the source identifiers corresponding to each of the multiple paths of voice data.

In this embodiment, the speech recognition module 71 is a part of the system responsible for speech recognition, and may be software, hardware, or a combination of the two.

In this embodiment, the determining module 72 is a part of the system responsible for determining the text information and the source identifier corresponding to the voice data, and may be software, hardware, or a combination of the two.

In this embodiment, the display module 73 is a part of the system that is responsible for displaying the text information corresponding to the voice data in the terminal, and may be software, hardware, or a combination of the two.

In some embodiments, the displaying module 73 displays the text information corresponding to each of the multiple paths of voice data in one or more terminals corresponding to the source identifiers corresponding to each of the multiple paths of voice data, respectively, and may include:

In some embodiments, the converting the text information corresponding to each of the multiple channels of voice data into subtitles may include:

In some embodiments, the system may further comprise:

and the setting module is used for determining the corresponding relation between the source identifier and the terminal according to the video played by the terminal.

In some embodiments, the determining module 72 may determine the source identifier corresponding to each of the multiple paths of voice data, and include:

In some embodiments, the performing, by the speech recognition module 71, speech recognition on the collected multiple paths of speech data from different sources may include:

In some embodiments, after the determining module 72 determines the text information and the source identifier corresponding to each of the multiple paths of voice data, the method may further include:

In this embodiment, each module corresponds to steps S110 to S130 in implementation embodiment 1, and details of implementation of each module may refer to details of implementation of the corresponding steps in embodiment 1.

Example 12

A speech recognition apparatus comprising:

the recognition module is used for carrying out voice recognition on the collected one or more paths of voice data;

and the text information and source identification determining module is used for determining the text information and the source identification corresponding to the one or more paths of voice data respectively.

In some embodiments, the speech recognition apparatus of this embodiment may be used as the recognition end device shown in fig. 2.

In this embodiment, the recognition module is a part of the apparatus responsible for voice recognition, and may be software, hardware, or a combination of the two.

In this embodiment, the text information and source identifier determining module is a part of the device responsible for determining the text information and the source identifier corresponding to the voice data, and may be software, hardware, or a combination of the two.

In some embodiments, the performing voice recognition on the one or more collected voice data by the recognition module may include:

In some embodiments, after the determining the text information and the source identifier corresponding to the one or more voice data by the text information and source identifier determining module, the method may further include:

In this embodiment, each module corresponds to steps S210 to S220 in implementation embodiment 2, and details of implementation of each module can be found in embodiment 2.

Example 13

A text information display apparatus comprising:

the acquisition module is used for acquiring text information corresponding to one or more source identifiers; the text information corresponding to the source identification is obtained by carrying out voice recognition on the voice data corresponding to the source identification;

and the text information display module is used for displaying the text information corresponding to the one or more source identifications in the terminal corresponding to the one or more source identifications according to the preset corresponding relation between the source identifications and the terminal.

In which voice recognition may be performed on voice data by other devices (such as but not limited to the device of embodiment 12), text information and source identifier corresponding to the voice data are determined, and the text information and the source identifier corresponding to the voice data are correspondingly stored. Details of the implementation of determining the source identifier and the text information corresponding to the voice data can be found in embodiment 1.

In this embodiment, the obtaining module is a part of the system responsible for obtaining the text information corresponding to the source identifier, and may be software, hardware, or a combination of the software and the hardware.

In this embodiment, the text information display module is a part of the device that is responsible for displaying the text information corresponding to the voice data in the terminal, and may be software, hardware, or a combination of the two.

In some embodiments, the displaying, by the displaying unit, the text information corresponding to the one or more source identifiers in the terminal corresponding to the one or more source identifiers includes:

In some embodiments, the obtaining unit obtains text information corresponding to one or more source identifiers may include:

In some embodiments, the obtaining, by the obtaining unit, text information corresponding to one or more source identifiers may include:

In this embodiment, each module corresponds to steps S310 to S320 in embodiment 3, and details of implementation of each module can be found in embodiment 3.

Example 14

A text information transmission device is pre-stored with text information and source identification which correspond to each other, wherein the text information corresponding to the source identification is obtained by performing voice recognition on voice data corresponding to the source identification; the device comprises:

the source identification determining module is used for respectively determining source identifications corresponding to different terminals according to the preset corresponding relation between the source identifications and the terminals;

and the sending module is used for respectively sending the text information corresponding to the source identifier corresponding to the terminal to different terminals.

After the text information corresponding to the source identifier corresponding to the terminal is respectively sent to different terminals, the text information corresponding to the multiple paths of voice data of different sources can be respectively displayed in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data according to the preset corresponding relationship between the source identifiers and the terminals. The details of the implementation when shown can be seen in example 1.

In this embodiment, the source identifier determining module is a part of the apparatus responsible for determining the source identifier, and may be software, hardware, or a combination of the two.

In this embodiment, the sending module is a part of the apparatus responsible for sending the text message corresponding to the source identifier, and may be software, hardware, or a combination of the two.

In this embodiment, each module corresponds to steps S410 to S420 in embodiment 4, and details of implementation of each module can be found in embodiment 4.

Example 15

A video caption fitting system comprising two major functions: the method comprises the steps that firstly, field audio input is collected, and the real-time voice recognition function from voice data to text information is completed through a cloud voice recognition service; in the voice recognition, the source identifier of the voice data, the source identifier and the text information are correspondingly stored in a recognition result database, which is equivalent to the realization of the functions of the voice acquisition device and the recognition end device shown in fig. 2; secondly, text information is obtained from the recognition result database in real time, subtitles are generated, the text information is synthesized with video data collected in a live broadcast site and then broadcast, and a final live broadcast image is completed, which is equivalent to the realization of the functions of display terminal equipment and a terminal shown in the figure 2.

The system of the embodiment can be applied to live video subtitle adding scenes of lectures, meetings and other scenes.

The system, as shown in fig. 8, comprises:

the audio input hardware (i.e. the acquisition end) completes the function of audio capture, and in this embodiment, the audio input hardware may include a microphone, an audio transmission line and a sound card; the function of the device is to collect the voice input of the site and convert the voice input into digital voice data;

the recognition client is responsible for determining the text information and the source identification corresponding to the voice data and correspondingly storing the text information and the source identification corresponding to the agreed voice data into a recognition result database;

the cloud voice recognition service is responsible for receiving voice data sent by the recognition client, converting the voice data into text information and sending the text information back to the recognition client;

identifying a result database to finish the function of correspondingly storing the text information and the source identification; the recognition result database is used as an information exchange intermediary of each recognition client and each caption generation client;

the caption generating client is responsible for determining a source identifier corresponding to the connected terminal according to the corresponding relation between the source identifier and the terminal, and acquiring text information corresponding to the determined source identifier in the identification result database; converting the acquired text information into a pure subtitle image, such as a text subtitle in a pure green screen;

the field image acquisition module is used as a video source and is responsible for acquiring field video data;

and the synthesized image module is used for performing superposition synthesis on the image in the video data returned by the field image acquisition module and the subtitle image generated by the subtitle generating client to generate a final live video of the image superposed subtitle, and the final live video is used for playing by terminals such as live video or screen projection.

In this embodiment, the recognition client and the audio input hardware are in a one-to-one correspondence, and the system may include a plurality of combinations of "audio input hardware + recognition client", for example, N groups in fig. 9. Each combination corresponds to voice data of one user or a user set, and the role information marked on the client side identified in the combination is used as a source identifier of the voice data corresponding to the combination. The text information corresponding to the voice data output by each combination also corresponds to the source identification corresponding to the voice data, so that the speaking contents of different users/user sets can be distinguished in the text information.

In this embodiment, the subtitle generating client, the live image acquiring module, and the composite image module are in a one-to-one correspondence relationship, and the system may include a combination of a plurality of "subtitle generating client + live image acquiring module + composite image module", for example, there are M groups in fig. 9. Wherein each combination can represent a display requirement, and each display requirement can display a different 'subtitle + video' combination; one display requirement may correspond to one or more terminals. For example, in an example, a large screen may be projected on a spot of a conference to display a video of a user speaking at present, and a subtitle converted from text information corresponding to voice data of the user is added; and simultaneously, the video network live broadcast can be carried out, and the video network live broadcast comprises the pictures of the speaking user and the live broadcast host and subtitles converted from text information corresponding to the voice data of the speaking user and the live broadcast host.

In this embodiment, the process of adding and matching the video subtitles is as follows:

after the system is started, the recognition client acquires voice data acquired by audio input hardware, and transmits the acquired voice data to a voice recognition service of the cloud end in real time through a network to complete a voice recognition function and acquire text information serving as a recognition result returned by the cloud end server;

after acquiring text information corresponding to voice data obtained by voice recognition, the recognition client sends the text information corresponding to the voice data and a source identifier to a recognition result database through a network for corresponding storage; the source identifier is an identifier for identifying the client, and in this embodiment, the source identifier is user role information labeled for identifying the client, such as "speaker a" or "original speaker" and the like;

the caption generating client acquires text information corresponding to one or more preset source identifications from the recognition result database and converts the acquired text information into captions;

the synthesized image module synthesizes the caption converted by the caption generating client with the video data acquired by the field image acquisition module and then sends the synthesized caption to a corresponding terminal for playing.

One application scenario of this embodiment is a court trial process in a court; the court trial site is divided into four areas: a defendant, an original defendant, an audition seat and a witness seat; each zone is equipped with a microphone as audio input hardware and a camera as a live image acquisition module. In addition, a camera capable of shooting the whole situation can be arranged. A plurality of display screens can be equipped on site as the terminal, such as a large screen that can be seen by all people, a small screen dedicated for review by the judge, and a screen for live broadcasting outside the court.

The number of microphones in each zone may be one or more. When a plurality of microphones are prepared for different persons in one area, the source identifiers corresponding to the voice data collected by the plurality of microphones may be the same or different. For example, the source identifiers corresponding to the voice information collected by the plurality of microphones of the original announcement are all set as "original party", and the source identifiers corresponding to the voice information collected by the respective plurality of microphones may be set as "original announcement", "original agent attorney", and the like.

There may be a plurality of cameras in each area, and different persons are photographed respectively. The camera may be directly connected to the composite image module or the composite image module may be provided with video data via a database or other intermediary.

Each composite image module can fix one of the cameras in the four areas as a video source, and can also use a plurality of cameras as video sources, and video data shot by one of the cameras is selected to be synthesized into subtitles during playing, and the video sources can be switched to other cameras according to instructions or requirements during playing.

And the video data played by the display screen is the video data provided by the corresponding composite image module. The corresponding relation between the display screen and the source identification can be determined or modified according to the video data played by the display screen lock. For example, in a display screen for playing a picture for shooting the whole world, a caption converted from text information corresponding to voice data collected by different users in a court is displayed; in a display screen for playing a picture photographed for a certain area, a subtitle into which text information corresponding to voice data collected by a user in the area is converted is displayed.

When the caption converted from the text information corresponding to the voice data of different users is played in the display screen, the caption comprises a source identifier; or, the text information corresponding to the voice data of different users is converted into subtitles, and the subtitles are distinguished by adopting different display parameters.

The application scenario includes two typical implementation manners, which are described by taking the case that the announcement board, the original announcement board, the trial board and the witness board respectively have a microphone, a camera and a display screen as an example.

One implementation is as shown in fig. 9, the microphone, the camera, and the display screen of each area are all connected to one processing device of the area (the microphone, the camera, and the display screen may also be connected to different processing devices of the area, respectively); each processing device can perform data transmission with the server. The server runs an identification client, a subtitle generation client, and a composition thread for composing subtitle and video data for each processing device. The server maintains the recognition result database (or maintains the recognition result database by another device, and performs data transmission with the server), and obtains and stores text information corresponding to voice data through the cloud voice recognition service. And video data shot by the camera is sent to the server through the connected processing device and provided for the corresponding synthetic thread by the server for use.

Another alternative embodiment is shown in fig. 10, in which the microphone, the camera, and the display screen of each area are all connected to one processing device of the area (alternatively, the microphone, the camera, and the display screen are respectively connected to different processing devices of the area); and running the recognition client in a processing device connected with the microphone, obtaining text information corresponding to the voice data through the cloud voice recognition service, and storing the text information in a recognition result database. And the processing device connected with the display screen runs the subtitle generating client and the synthetic image module. And each processing device performs data interaction with the identification result database on the server. The video data shot by the camera is directly sent or transferred to the processing device connected with the display screen needing to play the video data through the connected processing device or other equipment.

The implementation shown in fig. 9 is equivalent to the speech recognition, the subtitle generation, and the synthesis are all performed in a server, and the processing device in each area is only responsible for data transmission. The corresponding relation between the source identification and the terminal is maintained by the server; the server may monitor the global data flow direction, and update the correspondence according to a predetermined rule or update the correspondence according to an instruction when necessary. By updating the corresponding relation in the server, the display position of the caption in the system can be adjusted at one time.

The implementation shown in fig. 10 is equivalent to that voice recognition, subtitle generation, and synthesis are all performed by the processing device in each area, and the server is only used for maintaining the recognition result database. The corresponding relationship between the source identifier and the terminal needs to be maintained by the processing device in each area, and each processing device only includes a part of the corresponding relationship related to the processing device, such as the source identifier corresponding to a display screen connected with the processing device; the processing device may update the correspondence relations autonomously or according to instructions, but if the global display policy is to be changed, the correspondence relations in the plurality of processing devices need to be updated separately.

Of course, implementations other than those shown in fig. 9 and 10 may be adopted, such as one or more of speech recognition, subtitle generation, and synthesis are performed in the processing device, and other operations are performed in the server.

In this embodiment, the recognition client implements the function of the recognition end device shown in fig. 2, and the subtitle generating client and the synthesized image module (or the subtitle generating client, the synthesized image module, and the live image acquisition module) implement the function of the display end device shown in fig. 2 together, and details of implementation can be found in embodiment 1.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. All or some of the steps of the above embodiments may also be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present application is not limited to any specific form of hardware or software combination.

There are, of course, many other embodiments of the invention that can be devised without departing from the spirit and scope thereof, and it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the invention.

Claims

1. A text information display method is preset with a corresponding relation between a source identifier and a terminal, and comprises the following steps:

determining text information and a source identifier corresponding to each of the multiple paths of voice data, including:

respectively determining text information and source identification corresponding to the multiple paths of voice data by performing voice recognition on the multiple paths of voice data;

according to the corresponding relation, respectively displaying the text information corresponding to the multiple paths of voice data in one or more terminals corresponding to the source identifiers corresponding to the multiple paths of voice data;

and the corresponding relation can be updated in a self-adaptive manner according to the source or the content of the video played by the terminal.

2. The method of claim 1, wherein the displaying the text information corresponding to the multiple voice data in the one or more terminals corresponding to the source identifiers corresponding to the multiple voice data respectively comprises:

3. The method of claim 2, wherein the converting the text information corresponding to each of the plurality of voice data into subtitles comprises:

4. The method of claim 1, further comprising:

5. The method of claim 1, wherein determining the source identifiers corresponding to the multiple voice data further comprises:

and respectively determining the source identifiers corresponding to the multiple paths of voice data according to the acquisition ends of the multiple paths of voice data.

6. The method according to any one of claims 1 to 5, wherein the performing speech recognition on the collected multiple paths of speech data from different sources respectively comprises:

7. The method according to any one of claims 1 to 5, wherein the determining the text information and the source identifier corresponding to each of the multiple paths of voice data further comprises:

the displaying, according to the correspondence, the text information corresponding to each of the multiple paths of voice data in one or more of the terminals corresponding to the source identifiers corresponding to each of the multiple paths of voice data includes:

8. A speech recognition method comprising:

respectively determining text information and source identification corresponding to the one or more paths of voice data by performing voice recognition on the one or more paths of voice data;

the text information is used for being displayed in one or more terminals corresponding to the source identifiers of the corresponding voice data, and the corresponding relation between the source identifiers and the terminals is preset and can be updated in a self-adaptive mode according to the source or the content of the video played by the terminals.

9. The method of claim 8, wherein performing speech recognition on the one or more collected channels of speech data comprises:

10. The method of claim 8, wherein determining the text information and the source identifier corresponding to each of the one or more voice data further comprises:

11. A text information display method comprising:

displaying the text information corresponding to the one or more source identifiers in the terminal corresponding to the one or more source identifiers according to the corresponding relation between the preset source identifiers and the terminal;

12. The method of claim 11, wherein the displaying the text information corresponding to the one or more source identifiers in the terminal corresponding to the one or more source identifiers comprises:

13. The method of claim 11, wherein obtaining text information corresponding to one or more source identifiers comprises:

14. The method of claim 11, wherein obtaining text information corresponding to one or more source identifiers comprises:

15. A text information transmission method is characterized in that text information and source identification which correspond to each other are pre-stored, wherein the text information corresponding to the source identification is obtained by carrying out voice recognition on voice data corresponding to the source identification; the method comprises the following steps:

respectively sending text information corresponding to the source identification corresponding to the terminal to different terminals;

16. A text information display system, comprising:

at least one voice recognition device is used for performing voice recognition on one or more paths of collected voice data, and determining text information and source identification corresponding to the one or more paths of collected voice data respectively, and the method comprises the following steps: respectively determining text information and source identification corresponding to the one or more paths of voice data by performing voice recognition on the one or more paths of voice data;

the at least one text information display device is used for displaying text information corresponding to one or more paths of voice data in the terminal corresponding to the source identifier corresponding to the one or more paths of voice data according to the corresponding relation between the preset source identifier and the terminal;

17. The system of claim 16, wherein the displaying the text information corresponding to the one or more voice data in the terminal corresponding to the source identifier corresponding to the one or more voice data comprises:

18. The system of claim 16, wherein the speech recognition means performs speech recognition on one or more of the collected speech data paths comprising:

19. The system of claim 16, wherein:

the at least one voice recognition device is further used for correspondingly storing the text information and the source identification corresponding to the one or more paths of voice data in a database;

and at least one text information display device is also used for acquiring text information corresponding to one or more paths of voice data from the database.

20. The system of claim 19, wherein the text information display device obtains text information corresponding to one or more voice data from the database comprises:

21. A speech recognition apparatus comprising: a memory and a processor;

the method is characterized in that:

22. A speech acquisition device comprising: a microphone, a memory, a processor;

the microphone is used for collecting voice data;

the method is characterized in that:

performing voice recognition on voice data collected by the microphone;

performing voice recognition on voice data acquired by the microphone, and determining text information and a source identifier corresponding to the voice data;

23. A text information processing apparatus comprising: a memory, a processor;

the method is characterized in that:

24. A text information display apparatus comprising: the display screen, the memory and the processor;

the method is characterized in that:

displaying the acquired text information in the display screen according to the corresponding relation between the source identification and the terminal; the corresponding relation can be updated in a self-adaptive mode according to the source or the content of the video played by the terminal.