CN115134465A

CN115134465A - Text display method and device, storage medium and electronic device

Info

Publication number: CN115134465A
Application number: CN202210589139.9A
Authority: CN
Inventors: 刘莉; 王立国
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-09-30

Abstract

The application discloses a text display method and device, a storage medium and an electronic device, and relates to the technical field of smart families, wherein the text display method comprises the following steps: determining a first text input by a first object through first equipment, and carrying out voice synthesis on the first text to obtain a first audio; sending the first audio to second equipment, and acquiring second audio collected by the second equipment, wherein the second audio is audio sent by a second object after receiving the first audio through the second equipment; and performing voice recognition on the second audio, and displaying a second text obtained after the voice recognition through the first equipment. By adopting the technical scheme, the problem that normal voice communication between two communication parties cannot be carried out due to factors such as noisy environment and the like is solved.

Description

Text display method and device, storage medium and electronic device

Technical Field

The application relates to the technical field of smart homes, in particular to a text display method and device, a storage medium and an electronic device.

Background

With the rapid development of society, people become more and more convenient to realize voice calls, two parties in a call can realize the voice calls by using corresponding equipment respectively, but in some scenes, one party in the call may not have the condition of real-time voice calls due to factors such as noisy environment and the like, but needs to communicate with the other party in real time. For example, office workers, mothers or users with language barriers in the public transportation are in noisy environments or cannot meet the conditions of real-time communication, but need to contact family or friends through voice.

Aiming at the problem that normal voice communication between two communication parties cannot be carried out due to factors such as noisy environment and the like in the related technology, an effective solution is not provided at present.

Accordingly, there is a need for improvement in the related art to overcome the disadvantages of the related art.

Disclosure of Invention

The embodiment of the invention provides a text display method and device, a storage medium and an electronic device, which at least solve the problem that normal voice communication between two communication parties cannot be carried out due to factors such as noisy environment and the like.

According to an aspect of an embodiment of the present invention, there is provided a text display method, including: determining a first text input by a first object through first equipment, and carrying out voice synthesis on the first text to obtain a first audio; sending the first audio to second equipment, and acquiring second audio collected by the second equipment, wherein the second audio is audio sent by a second object after receiving the first audio through the second equipment; and performing voice recognition on the second audio, and displaying a second text obtained after the voice recognition through the first equipment.

Further, determining, by the first device, the first text of the first object input, includes: carrying out audio acquisition on the first object through first equipment, and carrying out voice recognition on the acquired audio to obtain a recognition text; displaying the identification text on a display page of the first device; and determining a first text obtained after the first object edits the identification text through the display page.

Further, prior to audio capture of the first object by the first device, the method further comprises: acquiring an image of a preset area through an image acquisition device of first equipment to obtain a target image; determining the first object according to a first target feature and the target image, wherein the first target feature comprises at least one of: mouth movement, distance from the first device, positional relationship with the first device; and/or audio acquisition is carried out through an audio acquisition device of the first equipment to obtain a target audio; determining the first object according to a second target feature and the target audio, wherein the second target feature at least comprises: voiceprint.

Further, audio acquisition of the first object by a first device includes: determining a target position relationship between the first object and the first device; and controlling an audio acquisition device of the first equipment to acquire audio in a target direction of the first equipment under the condition that the target position relation indicates that the first object is located in the target direction.

Further, performing voice recognition on the collected audio, including: under the condition that the voice characteristics of the first object exist in a voice library, carrying out sound enhancement on the sound corresponding to the first object in the collected audio according to the voice characteristics, and carrying out sound weakening on the sound which is not corresponding to the first object in the collected audio according to the voice characteristics to obtain the processed audio; performing voice recognition on the processed audio; in the case of capturing a target mouth action of the first object, performing speech recognition on the captured audio in combination with the target mouth action, wherein the target mouth action is used for assisting speech recognition.

Further, after determining, by the first device, the first text of the first object input, the method further comprises: and under the condition that the image of the first object and/or the audio of the first object are/is acquired by the first equipment, at least two of the first text, the image of the first object and the audio of the first object are sent to a second equipment.

Further, displaying, by the first device, a second text obtained after the speech recognition, including: under the condition that the image of the second object sent by the second equipment is obtained, displaying the image of the second object through the first equipment; and in the process of displaying the image of the second object, playing the second audio, and displaying the second text on the image of the second object.

According to another aspect of the embodiments of the present invention, there is also provided a text display apparatus including: the first processing module is used for determining a first text input by a first object through first equipment and carrying out voice synthesis on the first text to obtain a first audio; the second processing module is used for sending the first audio to second equipment and acquiring second audio collected by the second equipment, wherein the second audio is audio sent by a second object after the first audio is received by the second equipment; and the third processing module is used for carrying out voice recognition on the second audio and displaying a second text obtained after the voice recognition through the first equipment.

As an optional example, the first processing module is further configured to perform audio acquisition on the first object through a first device, and perform speech recognition on the acquired audio to obtain a recognition text; displaying the identification text on a display page of the first device; and determining a first text obtained after the first object edits the identification text through the display page.

As an optional example, the first processing module is further configured to, before performing audio acquisition on the first object through the first device, perform image acquisition on a preset area through an image acquisition device of the first device to obtain a target image; determining the first object according to a first target feature and the target image, wherein the first target feature comprises at least one of: mouth movement, distance from the first device, positional relationship with the first device; and/or audio acquisition is carried out through an audio acquisition device of the first equipment to obtain a target audio; determining the first object according to a second target feature and the target audio, wherein the second target feature at least comprises: voiceprint.

As an optional example, the first processing module is further configured to determine a target location relationship between the first object and the first device; and controlling an audio acquisition device of the first equipment to acquire audio in a target direction of the first equipment under the condition that the target position relation indicates that the first object is located in the target direction.

As an optional example, the first processing module is further configured to, in a case that a voice feature of the first object exists in a voice library, perform sound enhancement on a sound corresponding to the first object in the acquired audio according to the voice feature, and perform sound attenuation on a sound, which is not corresponding to the first object, in the acquired audio according to the voice feature, to obtain a processed audio; performing voice recognition on the processed audio; in the case of capturing a target mouth action of the first object, performing speech recognition on the captured audio in combination with the target mouth action, wherein the target mouth action is used for assisting speech recognition.

As an optional example, the second processing module is further configured to, after determining, by the first device, a first text input by the first object, send at least two of the first text, the image of the first object, and the audio of the first object to the second device in a case where the image of the first object and the audio of the first object are captured by the first device.

As an optional example, the third processing module is further configured to, in a case that the image of the second object sent by the second device is acquired, display the image of the second object through the first device; and in the process of displaying the image of the second object, playing the second audio, and displaying the second text on the image of the second object.

According to still another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above text display method when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the text display method through the computer program.

According to the invention, when the environment where the first object is located is noisy, the content which the first object wants to express is transmitted to the second object in a voice synthesis mode, so that the second object can clearly hear the content expressed by the first object, meanwhile, voice recognition is carried out on the voice of the second object, and the voice is displayed on the first device, so that the first object can also clearly know the content expressed by the second object, and further the problem that normal voice communication cannot be carried out between two communication parties due to factors such as noisy environment is solved, and the user can still contact family or friends through the voice under the condition that the environment for the user is noisy or the user does not meet real-time communication.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a diagram illustrating a hardware environment of a text display method according to an embodiment of the present application;

FIG. 2 is a flow diagram of a text display method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a display interface of a first device according to an embodiment of the invention;

fig. 4 is an overall frame diagram of a text display method according to an embodiment of the present invention;

fig. 5 is a block diagram of a structure of a text display apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of an embodiment of the present application, there is provided a text display method. The text display method is widely applied to full-house intelligent digital control application scenes such as Smart homes (Smart Home), intelligent homes, intelligent household equipment ecology, intelligent house (Intellignee House) ecology and the like. Alternatively, in the present embodiment, the text display method described above may be applied to a hardware environment formed by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be configured to provide a service (e.g., an application service) for the terminal or a client installed on the terminal, provide a database on or independent of the server for providing a data storage service for the server 104, and configure a cloud computing and/or edge computing service on or independent of the server for providing a data operation service for the server 104.

The network may include, but is not limited to, at least one of: wired networks, wireless networks. The wired network may include, but is not limited to, at least one of: wide area networks, metropolitan area networks, local area networks, which may include, but are not limited to, at least one of the following: WIFI (Wireless Fidelity), bluetooth. Terminal equipment 102 can be but not limited to be PC, the cell-phone, the panel computer, intelligent air conditioner, intelligent cigarette machine, intelligent refrigerator, intelligent oven, intelligent kitchen range, intelligent washing machine, intelligent water heater, intelligent washing equipment, intelligent dish washer, intelligent projection equipment, intelligent TV, intelligent clothes hanger, intelligent (window) curtain, intelligence audio-visual, smart jack, intelligent stereo set, intelligent audio amplifier, intelligent new trend equipment, intelligent kitchen guarding equipment, intelligent bathroom equipment, intelligence robot of sweeping the floor, intelligence robot of wiping the window, intelligence robot of mopping the ground, intelligent air purification equipment, intelligent steam ager, intelligent microwave oven, intelligent kitchen is precious, intelligent clarifier, intelligent water dispenser, intelligent lock etc..

In order to solve the above problem, in this embodiment, a text display method is provided, including but not limited to an application in a server, and fig. 2 is a flowchart of a text display method according to an embodiment of the present invention, where the flowchart includes the following steps:

step S202, determining a first text input by a first object through first equipment, and carrying out voice synthesis on the first text to obtain a first audio;

as an alternative example, the first device includes, but is not limited to: mobile terminal, intelligent house equipment. The first object includes, but is not limited to, a user using the first device.

As an alternative example, the first object may directly input the first text on the display interface corresponding to the first device, and then the first device sends the first text to the server, and the server performs speech synthesis on the first text.

Optionally, if the first device has a speech synthesis engine, the first device may also perform speech synthesis on the first text, and send the first audio after the speech synthesis to the server. Optionally, the server includes, but is not limited to, a cloud server.

As an alternative example, determining, by the first device, the first text of the first object input may be accomplished by the following steps S11-S13:

s11: carrying out audio acquisition on the first object through first equipment, and carrying out voice recognition on the acquired audio to obtain a recognition text;

s12: displaying the identification text on a display page of the first device;

s13: and determining a first text obtained after the first object edits the identification text through the display page.

It should be noted that, because the content to be expressed by the first object is more, the first object is slow only in a text input manner, and then the first object can express the content to be expressed by itself in a speech manner, and then the first device collects the audio of the first object and sends the audio to the server, and then the server performs speech recognition on the audio of the first object, but because of factors such as noisy environment, the accuracy rate of recognizing the speech of the first object may be low, and then the recognized text obtained by the speech recognition is displayed on the display page of the first device, so that the first object can perform corresponding modification on the recognized text, and obtain the first text that the first object wants to be expressed really.

As an alternative example, before the first object is subjected to audio acquisition by the first device, it is further required to determine an object for acquiring audio, that is, determine the first object, which may be implemented in the following manner one, and/or manner two:

the first method is as follows: acquiring an image of a preset area through an image acquisition device of first equipment to obtain a target image; determining the first object according to a first target feature and the target image, wherein the first target feature comprises at least one of: mouth movement, distance from the first device, positional relationship with the first device;

alternatively, the first device may perform image acquisition on the preset area through the image acquisition device, and then send the target image to the server, and the server may determine, according to the mouth motion of the person in the target image, candidate speakers from the multiple persons in the target image, and then determine the actual speaking first object according to parameters such as a distance and a positional relationship (for example, in what direction of the first device) between the first device and the candidate speakers.

Optionally, if the server stores a target face image feature of the first object, an object corresponding to a face image feature of which the similarity to the target face image feature is greater than a preset threshold value among face image features of a plurality of objects in the target image may be determined as the first object. And then when audio is collected, audio collection is performed mainly in the direction of the first object.

The second method comprises the following steps: carrying out audio acquisition through an audio acquisition device of the first equipment to obtain a target audio; determining the first object according to a second target feature and the target audio, wherein the second target feature at least comprises: voiceprint.

Optionally, because the environment in which the first object is located is noisy, and then there may be noise in the audio collected by the audio collecting device (e.g., a microphone) of the first device or audio of other objects, if the server stores a voiceprint of the first object, then the voiceprint may determine the audio of the first object from a plurality of audios, and then determine the speaking first object in real time through the audio. And then when gathering the audio frequency, focus on gathering the sound of the first object.

As an alternative example, after determining the audio frequency of the first object, the position relationship between the first object and the first device may be determined according to the audio frequency of the first object, and then the first device performs image capture on the direction in which the first object is located, and then determines the first object by the first method. And then when gathering the audio frequency, focus on the direction that the first object is located to carry out the audio frequency and gather the sound of first object again.

As an alternative example, the audio acquisition of the first object by the first device may be implemented by: determining a target position relationship between the first object and the first device; and controlling an audio acquisition device of the first equipment to acquire audio in a target direction of the first equipment under the condition that the target position relation indicates that the first object is located in the target direction.

It should be noted that, for better audio acquisition of the first object, the first object may be directionally picked up according to the target position relationship pair of the first object and the first device. Optionally, during the audio acquisition of the first object, the sound of the non-first object (e.g., environmental noise) may be weakened.

As an alternative example, performing speech recognition on the collected audio may be further implemented by: under the condition that the voice characteristics of the first object exist in a voice library, carrying out sound enhancement on the sound corresponding to the first object in the collected audio according to the voice characteristics, and carrying out sound weakening on the sound which is not corresponding to the first object in the collected audio according to the voice characteristics to obtain the processed audio; performing voice recognition on the processed audio; in the case of capturing a target mouth motion of the first object, performing speech recognition on the captured audio in combination with the target mouth motion, wherein the target mouth motion is used to assist speech recognition.

In order to improve the accuracy of speech recognition, the sound of the first object in the captured audio may be enhanced, the sound of the non-first object may be attenuated, and the target mouth movement of the first object may be used to assist speech recognition.

Step S204, sending the first audio to a second device, and acquiring a second audio collected by the second device, wherein the second audio is an audio sent by a second object after receiving the first audio through the second device;

as an alternative example, the second device includes, but is not limited to, a smart device in a home, for example, a smart refrigerator, a smart air conditioner, a smart camera, etc.; the intelligent device has the functions of voice acquisition and image acquisition. The second object includes, but is not limited to, a user using the second device.

As an alternative example, the third text sent by the second object through the second device may also be directly obtained, and then the third text may be displayed through the first device. Optionally, a manner of sending the third text by the second device by the second object is similar to a manner of sending the first text by the first object by the first device, and details are not described in this embodiment.

As an alternative example, after determining a first text of a first object input by a first device, if an image of the first object and an audio of the first object are captured by the first device, at least two of the first text, the image of the first object, and the audio of the first object may be transmitted to a second device.

And step S206, performing voice recognition on the second audio, and displaying a second text obtained after the voice recognition through the first equipment.

As an alternative example, displaying, by the first device, the second text obtained after the speech recognition may be implemented by: under the condition that the image of the second object sent by the second equipment is obtained, displaying the image of the second object through the first equipment; and in the process of displaying the image of the second object, playing the second audio, and displaying the second text on the image of the second object.

Fig. 3 is a schematic diagram of a display interface of the first device according to an embodiment of the present invention, and as shown in fig. 3, the second text may be displayed on the image in a pop-up window manner, so that the first object can both see the image and hear the sound of the second object, and display text corresponding to the sound, so that the user experience is better.

Through the steps S202-S206, when the environment where the first object is located is noisy, the content that the first object wants to express is transmitted to the second object in a voice synthesis mode, so that the second object can clearly hear the expressed content of the first object, meanwhile, voice recognition is performed on the voice of the second object, the voice is displayed on the first device, so that the first object can clearly know the content expressed by the second object, the problem that normal voice communication cannot be performed between two communication parties due to factors such as noisy environment is solved, and the user can still contact family or friends through the voice under the condition that the environment for the user is noisy or the user does not meet real-time communication.

It is to be understood that the above-described embodiments are only a few, but not all, embodiments of the present invention. In order to better understand the text display method, the following describes the above process with reference to an embodiment, but the process is not limited to the technical solution of the embodiment of the present invention, and specifically:

fig. 4 is an overall frame diagram of a text display method according to an embodiment of the present invention, and taking a first device as a mobile terminal and a second device as a camera as an example, as shown in fig. 4:

the user can input characters in a text input box in an APP detail page corresponding to the camera on the mobile terminal, and click and send the characters;

optionally, the mobile terminal may further determine a candidate speaker based on the lip movement, and determine the actual speaker according to parameters such as a distance and an angle between the mobile terminal and the candidate speaker. After the actual speaker is found, the mobile terminal is controlled to directionally pick up sound of the actual speaker, and background environment noise is weakened. And performing voice recognition on the picked voice to convert the voice into text information, displaying the text information in the text input box, receiving the modification of the text information in the text box by the actual speaker, and sending the modified text information to the server.

The server receives the characters submitted by the APP detail page and converts the characters into voice information; the server sends the converted voice information to the camera; the camera plays voice information to a user at the camera head end; the camera acquires voice information replied by a camera standby end user and transmits the voice information to the server; the server receives voice information of a user at the camera head end and converts the voice information into text information;

optionally, when the network condition is good, the camera may also send the voice information and the video to the server synchronously.

And the server sends the converted text information to an APP detail page, displays the converted text information to the user, and finally completes real-time communication between the terminal user and the camera head end user.

Optionally, the mobile terminal may display the video and the voice through the APP detail page and display the text information in the form of a bullet screen, when acquiring the language and the video.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a text display device is further provided, and the text display device is used for implementing the above embodiments and preferred embodiments, which have already been described and will not be described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware or a combination of software and hardware are also possible and contemplated.

Fig. 5 is a block diagram of a structure of a text display apparatus according to an embodiment of the present invention, the apparatus including:

the first processing module 52 is configured to determine a first text input by a first object through a first device, and perform speech synthesis on the first text to obtain a first audio;

the second processing module 54 is configured to send the first audio to a second device, and obtain a second audio collected by the second device, where the second audio is an audio sent by a second object after receiving the first audio through the second device;

and a third processing module 56, configured to perform speech recognition on the second audio, and display, by the first device, a second text obtained after the speech recognition.

Through the device, under the condition that the environment where the first object is located is noisy, the content which the first object wants to express is transmitted to the second object in a speech synthesis mode, the second object can clearly hear the expressed content of the first object, meanwhile, speech recognition is carried out on the speech of the second object, the speech is displayed on the first device, the first object can also clearly know the content expressed by the second object, the problem that normal speech communication cannot be carried out between two communication parties due to factors such as noisy environment is solved, and under the condition that the environment for a user is noisy or the environment for the user does not meet real-time communication, the user can still be contacted with family or friends through the speech.

As an optional example, the first processing module 52 is further configured to perform audio acquisition on the first object through a first device, and perform speech recognition on the acquired audio to obtain a recognition text; displaying the identification text on a display page of the first device; and determining a first text obtained after the first object edits the identification text through the display page.

As an optional example, the first processing module 52 is further configured to, before performing audio acquisition on the first object through the first device, perform image acquisition on a preset area through an image acquisition device of the first device to obtain a target image; determining the first object according to a first target feature and the target image, wherein the first target feature comprises at least one of: mouth movement, distance from the first device, positional relationship with the first device; and/or audio acquisition is carried out through an audio acquisition device of the first equipment to obtain a target audio; determining the first object according to a second target feature and the target audio, wherein the second target feature at least comprises: voiceprint.

As an optional example, the first processing module 52 is further configured to determine a target location relationship between the first object and the first device; and controlling an audio acquisition device of the first equipment to acquire audio in the target direction under the condition that the target position relation indicates that the first object is positioned in the target direction of the first equipment.

As an optional example, the first processing module 52 is further configured to, when a speech feature of the first object exists in a speech library, perform sound enhancement on a sound corresponding to the first object in the acquired audio according to the speech feature, and perform sound attenuation on a sound not corresponding to the first object in the acquired audio according to the speech feature, so as to obtain a processed audio; performing voice recognition on the processed audio; in the case of capturing a target mouth action of the first object, performing speech recognition on the captured audio in combination with the target mouth action, wherein the target mouth action is used for assisting speech recognition.

As an optional example, the second processing module 54 is further configured to, after determining, by the first device, a first text input by the first object, send at least two of the first text, the image of the first object, and the audio of the first object to the second device in a case where the image of the first object and the audio of the first object are captured by the first device.

As an optional example, the third processing module 56 is further configured to, in a case that the image of the second object sent by the second device is acquired, display the image of the second object through the first device; and in the process of displaying the image of the second object, playing the second audio, and displaying the second text on the image of the second object.

An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, determining a first text input by a first object through first equipment, and carrying out voice synthesis on the first text to obtain a first audio;

s2, sending the first audio to a second device, and obtaining a second audio collected by the second device, wherein the second audio is an audio sent by a second object after receiving the first audio through the second device;

and S3, performing voice recognition on the second audio, and displaying a second text obtained after the voice recognition through the first equipment.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

For specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and exemplary implementations, and details of this embodiment are not repeated herein.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A text display method, comprising:

determining a first text input by a first object through first equipment, and carrying out voice synthesis on the first text to obtain a first audio;

sending the first audio to second equipment, and acquiring second audio collected by the second equipment, wherein the second audio is audio sent by a second object after receiving the first audio through the second equipment;

and performing voice recognition on the second audio, and displaying a second text obtained after the voice recognition through the first equipment.

2. The method of claim 1, wherein determining, by the first device, the first text of the first object input comprises:

carrying out audio acquisition on the first object through first equipment, and carrying out voice recognition on the acquired audio to obtain a recognition text;

displaying the identification text on a display page of the first device;

and determining a first text obtained after the first object edits the identification text through the display page.

3. The method of claim 2, wherein prior to audio capturing of the first object by the first device, the method further comprises:

acquiring an image of a preset area through an image acquisition device of first equipment to obtain a target image; determining the first object according to a first target feature and the target image, wherein the first target feature comprises at least one of: mouth movement, distance from the first device, and positional relationship with the first device; and/or

Acquiring audio through an audio acquisition device of first equipment to obtain target audio; determining the first object according to a second target feature and the target audio, wherein the second target feature at least comprises: voiceprint.

4. The method of claim 2 or 3, wherein audio capturing of the first object by a first device comprises:

determining a target position relationship of the first object and the first device;

and controlling an audio acquisition device of the first equipment to acquire audio in a target direction of the first equipment under the condition that the target position relation indicates that the first object is located in the target direction.

5. The method of claim 2, wherein performing speech recognition on the captured audio comprises:

under the condition that the voice characteristics of the first object exist in a voice library, carrying out sound enhancement on the sound corresponding to the first object in the collected audio according to the voice characteristics, and carrying out sound weakening on the sound which is not corresponding to the first object in the collected audio according to the voice characteristics to obtain the processed audio; performing voice recognition on the processed audio;

in the case of capturing a target mouth action of the first object, performing speech recognition on the captured audio in combination with the target mouth action, wherein the target mouth action is used for assisting speech recognition.

6. The method of any of claims 1-5, wherein after determining, by the first device, the first text of the first object input, the method further comprises:

and under the condition that the image of the first object and the audio of the first object are acquired by the first equipment, at least two of the first text, the image of the first object and the audio of the first object are sent to a second equipment.

7. The method of claim 6, wherein displaying, by the first device, the second text after the speech recognition comprises:

under the condition that the image of the second object sent by the second equipment is obtained, displaying the image of the second object through the first equipment;

and in the process of displaying the image of the second object, playing the second audio, and displaying the second text on the image of the second object.

8. A text display apparatus, comprising:

the first processing module is used for determining a first text input by a first object through first equipment and carrying out voice synthesis on the first text to obtain a first audio;

the second processing module is used for sending the first audio to second equipment and acquiring second audio collected by the second equipment, wherein the second audio is audio sent by a second object after the first audio is received by the second equipment;

and the third processing module is used for carrying out voice recognition on the second audio and displaying a second text obtained after the voice recognition through the first equipment.

9. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.