CN113889114A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113889114A
CN113889114A CN202010627737.1A CN202010627737A CN113889114A CN 113889114 A CN113889114 A CN 113889114A CN 202010627737 A CN202010627737 A CN 202010627737A CN 113889114 A CN113889114 A CN 113889114A
Authority
CN
China
Prior art keywords
virtual information
display
voice data
user
display device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010627737.1A
Other languages
Chinese (zh)
Inventor
武卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010627737.1A priority Critical patent/CN113889114A/en
Publication of CN113889114A publication Critical patent/CN113889114A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/04812Interaction techniques based on cursor appearance or behaviour, e.g. being affected by the presence of displayed objects

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the disclosure discloses a data processing method, a data processing device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring voice data in an acquisition area; processing the voice data to obtain virtual information, wherein the virtual information comprises text contents corresponding to the voice data; and outputting the virtual information to at least one display device in the user area so as to display the virtual information on the display device, so that the user can view the virtual information in the sight line range of the user when viewing the acquisition area through the display device. According to the technical scheme, the virtual display technology and the voice recognition technology are fused, so that the technical problem that some special scenes and/or hearing-impaired people cannot accurately receive voice information when watching the field information such as performances and the like is solved.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
In some artistic scenes, the requirement that a hearing-impaired person watches the performance is met by increasing the lines of the actors displayed on the display screen, but in the scheme, manual operation needs to be carried out on each line of lines, the efficiency is not high, and the situation that some audiences cannot clearly see the captions on the line of lines exists, and in the mode, only the content of the lines can be presented, and diversified information cannot be presented on the basis.
Disclosure of Invention
The embodiment of the disclosure provides a data processing method and device, electronic equipment and a computer-readable storage medium.
In a first aspect, an embodiment of the present disclosure provides a data processing method, including:
acquiring voice data in an acquisition area;
processing the voice data to obtain virtual information, wherein the virtual information comprises text contents corresponding to the voice data;
and outputting the virtual information to at least one display device in the user area so as to display the virtual information on the display device, so that the user can view the virtual information in the sight line range of the user when viewing the acquisition area through the display device.
Further, processing the voice data to obtain virtual information includes:
preprocessing the voice data;
recognizing the preprocessed voice data by using an acoustic model to obtain corresponding candidate content;
and performing semantic processing on the candidate content by utilizing a semantic model to obtain the character content.
Further, after processing the voice data to obtain the virtual information, the method further includes:
and translating the text content into target content corresponding to the target language associated with the display equipment.
Further, the display device comprises an AR display device.
In a second aspect, an embodiment of the present disclosure provides a data processing method, which is performed on a display device, and includes:
acquiring virtual information; the virtual information comprises text contents obtained by identifying voice data collected in the collection area;
and displaying the virtual information so that the user can see the virtual information in the sight range of the user when watching the acquisition area through the display equipment.
Further, displaying the virtual information includes:
acquiring image data in the acquisition area;
acquiring a display image by superimposing the virtual information on the image data;
and displaying the display image.
Further, the display device includes a transparent display unit that displays the virtual information, including:
and displaying the text content on the transparent display unit, so that the text content is displayed on the information viewed by the user through the transparent display unit in an overlapping manner.
Further, the method further comprises:
receiving a target language configured by a user;
before displaying the virtual information, the method further comprises the following steps:
and when the text content in the virtual information is not matched with the target language, translating the text content in the virtual information into the target content corresponding to the target language.
In a third aspect, an embodiment of the present disclosure provides a data processing method, where the method is performed on a display device, where the display device includes a voice acquisition unit and a display unit, and includes:
acquiring voice data acquired by the voice acquisition unit;
processing the voice data to obtain virtual information, wherein the virtual information comprises text contents corresponding to the voice data;
outputting the virtual information to the display unit to display the virtual information on the display unit.
Further, processing the voice data to obtain virtual information includes:
preprocessing the voice data;
recognizing the preprocessed voice data by using an acoustic model to obtain corresponding candidate content;
and performing semantic processing on the candidate content by utilizing a semantic model to obtain the character content.
Further, the display apparatus includes an image acquisition unit that outputs the virtual information to the display unit to display the virtual information on the display unit, including:
acquiring image data acquired by the image acquisition unit;
acquiring a display image by superimposing the virtual information on the image data;
and outputting the display image to the display unit for displaying.
Further, the display unit includes a transparent display unit to which the virtual information is output to display the virtual information on the display unit, including:
and outputting the text content to the transparent display unit for displaying, so that the text content is displayed on the information viewed by the user through the transparent display unit in an overlapping manner.
Further, the method further comprises:
receiving a target language configured by a user;
before outputting the virtual information to the display unit, the method further includes:
and when the text content in the virtual information is not matched with the target language, translating the text content in the virtual information into the target content corresponding to the target language.
In a fourth aspect, an embodiment of the present invention provides a data processing apparatus, including:
the first acquisition module is configured to acquire voice data in an acquisition area;
the first processing module is configured to process the voice data to obtain virtual information, and the virtual information comprises text content corresponding to the voice data;
the first output module is configured to output the virtual information to at least one display device in a user area so as to display the virtual information on the display device, so that a user can view the virtual information in a user sight range when watching the acquisition area through the display device.
In a fifth aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus is located on a display device, and the apparatus includes:
a second acquisition module configured to acquire virtual information; the virtual information comprises text contents obtained by identifying voice data collected in the collection area;
the display module is configured to display the virtual information so that the user can see the virtual information in a user sight range when watching the acquisition area through the display device.
In a sixth aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus is located in a display device, the display device includes a voice acquisition unit and a display unit, and the apparatus includes:
the third acquisition module is configured to acquire the voice data acquired by the voice acquisition unit;
the second processing module is configured to process the voice data to obtain virtual information, and the virtual information comprises text contents corresponding to the voice data;
a second output module configured to output the virtual information to the display unit to display the virtual information on the display unit.
The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.
In one possible design, the apparatus includes a memory configured to store one or more computer instructions that enable the apparatus to perform the corresponding method, and a processor configured to execute the computer instructions stored in the memory. The apparatus may also include a communication interface for the apparatus to communicate with other devices or a communication network.
In a seventh aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any of the above aspects.
In an eighth aspect, the present disclosure provides a computer-readable storage medium for storing computer instructions for use by any one of the above apparatuses, which includes computer instructions for performing the method according to any one of the above aspects.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
through the embodiment of the disclosure, the voice of the target object in the acquisition area can be converted into the text content in real time, and the virtual information including the text content is sent to the display device to be displayed to the user, so that the user can watch the text content corresponding to the voice data in the acquisition area through the display device while watching the acquisition area. The embodiment combines the virtual display technology and the voice recognition technology, so that the technical problem that some special scenes and/or hearing-impaired people cannot accurately receive voice information when watching the field information such as performances and the like is solved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:
FIG. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure;
FIG. 2 shows a flow diagram of a data processing method according to another embodiment of the present disclosure;
FIG. 3 shows a flow diagram of a data processing method according to yet another embodiment of the present disclosure;
fig. 4 shows a schematic flow chart of an application in a stage performance scene according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device suitable for implementing a data processing method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.
It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The details of the embodiments of the present disclosure are described in detail below with reference to specific embodiments.
Fig. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure. As shown in fig. 1, the data processing method includes the steps of:
in step S101, acquiring voice data in an acquisition area;
in step S102, processing the voice data to obtain virtual information, where the virtual information includes text content corresponding to the voice data;
in step S103, the virtual information is output to at least one display device in the user area, so as to display the virtual information on the display device, so that the user can view the virtual information in the user sight range when viewing the collection area through the display device.
In this embodiment, the data processing method may be implemented on a processor, and the processor may be located on a server or other processing device different from the display device. The capture area and the user area may be predetermined, for example, the capture area may be an area where a target object outputting a voice is located, and the user area may be an area where an object receiving a voice is located. For example, in an artistic application scenario, the capture area may be a stage area where the performer is located, and the user area may be an area where the audience is located. The voice acquisition device can be arranged in the acquisition area, such as a microphone and the like, outputs the voice data acquired in real time to the processor, and the processor converts the voice data into corresponding text contents after processing the voice data in real time and sends virtual information comprising the text contents to the display device in the user area.
In some embodiments, the virtual information may also include display information of the text content and other related information, and the display information may include, for example, information of display position, display mode, display format, and the like. The virtual information may also include synchronization information, such as the occurrence time of the voice data corresponding to the text content.
In this embodiment, one or more display devices may be included in the user area for use by one or more users. The display device, after receiving the virtual information, may display the virtual information within a line of sight of the user while the user views the collection area through the display device. In some embodiments, the display device may be an AR display device. For example, the display device may be AR glasses, and when the user wears the AR glasses to watch the collection area, the user may display the virtual information on the glasses, so that the user can see the corresponding virtual information while seeing the real picture in the collection area through the AR glasses, for example, when the audience watches stage performance through the AR glasses, the audience can see the real performance picture on the stage through the AR glasses, and simultaneously can see the corresponding virtual information such as subtitles.
Through the embodiment of the disclosure, the voice of the target object in the acquisition area can be converted into the text content in real time, and the virtual information including the text content is sent to the display device to be displayed to the user, so that the user can watch the text content corresponding to the voice data in the acquisition area through the display device while watching the acquisition area. The embodiment combines the virtual display technology and the voice recognition technology, so that the technical problem that some special scenes and/or hearing-impaired people cannot accurately receive voice information when watching the field information such as performances and the like is solved.
In an optional implementation manner of this embodiment, step S101, namely the step of acquiring the voice data in the collection area, further includes the following steps:
and acquiring voice data acquired by the voice acquisition equipment in real time from the voice acquisition equipment arranged in the acquisition area.
In this optional implementation manner, a voice acquisition device may be disposed in the acquisition area, and is configured to acquire voice data in the acquisition area in real time. The voice acquisition device may be, for example, a 360-degree microphone array, and the voice acquisition device may perform preprocessing such as amplification on the acquired voice data after acquiring the voice data, and then output the voice data to the server side. It is to be understood that the server may be a local computer device capable of processing voice data, or may be a remote server device, which may be specifically set according to actual needs, and is not limited herein.
In an optional implementation manner of this embodiment, step S102, namely, the step of processing the voice data to obtain the virtual information, further includes the following steps:
preprocessing the voice data;
recognizing the preprocessed voice data by using an acoustic model to obtain corresponding candidate content;
and performing semantic processing on the candidate content by utilizing a semantic model to obtain the character content.
In this optional implementation manner, preprocessing such as noise reduction and filtering may be performed on the voice data, and the audio content of the target object in the acquisition region is extracted from the voice data. The target object can be any object which emits voice in the acquisition area, or one or more objects, the audio content of the target object can be extracted from the preprocessed voice data through functions such as audio recognition, and then the voice recognition is carried out on the audio content by using an acoustic model to obtain corresponding candidate content. And finally, performing context semantic processing on the candidate contents by using a semantic model, and outputting character contents conforming to semantic logic, wherein the semantic model can be obtained by training lines commonly used in the performance scene and a script. The acoustic model and the semantic model may employ models already implemented in the related art, and are not limited herein.
In some embodiments, the acquisition area may further include an image acquisition device configured to acquire image data synchronized with the voice data, the image data and synchronization information synchronized with the voice data may be sent to the server, and the server may identify whether the voice data is sent by a target object in the image data, and perform corresponding preprocessing according to an identification result. For example, in an application scene of watching a stage play, if it is recognized from the image data synchronized with the voice data that the current sound is not the sound made by the actor but is made by the surrounding noise or other people, the server side may filter out the sound and only retain the sound made in the actor's mouth on the stage. It is understood that, in an application scene such as a stage drama, the voice of the actor will be more prominent than the surrounding noise, so that the voice data can be filtered by a common filter, and the voice of the actor is retained, which can be specifically set according to the actual needs, and is not limited herein.
In other embodiments, the server may further identify sound directivity according to the image data synchronized with the voice data, that is, which direction the sound emitted by the target object is directed to, and then may perform corresponding processing based on the sound directivity. In the case where the user area corresponds to a plurality of display apparatuses, the server may transmit the processed virtual information to the display apparatus to which the sound is directed according to the sound directivity without transmitting the processed virtual information to other display apparatuses. For example, in a performance scene, an actor wants to interact with the audience under the table and make sounds facing the audiences in different areas at different times, and the server can transmit virtual information to the display devices of the audiences in the pointed areas according to the current sound directivity of the actor.
In an application scene of watching a stage performance, bullet screen information surrounding actors can be displayed on the display device. The user may add the barrage through voice control or a setup interface on the display device, or may add the barrage through a user device interacting with the display device. Display device can also upload the barrage that the user added to the server side, and the server side can share the barrage information on other display device.
In an optional implementation manner of this embodiment, after step S102, that is, after the step of processing the voice data to obtain the virtual information, the method further includes the following steps:
and translating the text content into target content corresponding to the target language associated with the display equipment.
In this optional implementation, the text content obtained by the speech data recognition may be translated into the target content corresponding to the target language. The target language may be a language type associated with the display device, such as chinese, english, and the like. The display device can comprise a plurality of display devices, different target languages can be associated with different display devices, and after the text content corresponding to the voice data is determined, the text content can be automatically translated into the target content corresponding to the associated target language, and then the virtual information comprising the target content is output to the corresponding display device. By the method, the problem that people with special scenes or hearing disorders cannot effectively receive the voice information output by the target object in the acquisition area can be solved, and the problem of language difference between the target object and a user of the display equipment can be solved. For example, for artistic performance scenes, the dramas of different languages can be brought to the audiences all over the world to be watched in the mode, and the threshold of cultural appreciation is greatly reduced.
Fig. 2 shows a flow diagram of a data processing method according to another embodiment of the present disclosure. As shown in fig. 2, the data processing method includes the steps of:
in step S201, virtual information is acquired; the virtual information comprises text contents obtained by identifying voice data collected in the collection area;
in step S202, the virtual information is displayed so that the user can view the virtual information in the user' S sight line when viewing the acquisition area through the display device.
In this embodiment, the data processing method may be implemented on a display device, for example, an AR display device. The capture area and the user area may be predetermined, for example, the capture area may be an area where a target object outputting a voice is located, and the user area may be an area where an object receiving a voice is located. For example, in an artistic application scenario, the capture area may be a stage area where the performer is located, and the user area may be an area where the audience is located. The voice acquisition device such as a microphone can be arranged in the acquisition area, the voice acquisition device outputs voice data acquired in real time to corresponding equipment, a processor on the corresponding equipment converts the voice data into corresponding text contents after processing the voice data in real time, and virtual information including the text contents is sent to display equipment in the user area.
In some embodiments, the display device may be an AR display device, such as AR glasses.
The user may view objects, such as people, objects, scenes, etc., within the capture area through the AR display device. The AR display device can be provided with a display unit, and after the virtual information is acquired, the virtual information can be displayed on the AR display device, so that the virtual information can be displayed in a user sight range in an overlapping mode when a user watches the object in the acquisition area through the AR display device. For example, when a user in the audience area wears AR glasses to watch a stage performance, the voice of the stage performance can be displayed in a superimposed manner in a stage scene watched by the user through the AR glasses.
In some embodiments, the virtual information may also include display information of the text content and other related information, such as display position, display mode, display format, and the like. The display device can display the text content at a proper position according to the display information and other related information in the virtual information, so that the user does not shield the viewing line when watching the object in the acquisition area, and the display device can also adjust the size of the displayed text after calculation by detecting the visual angle, the binocular distance and the like of the user, thereby avoiding the user from experiencing problems of caption defocusing or dizziness and the like.
For other details in the embodiments of the present disclosure, reference may also be made to the description of the data processing method in fig. 1 and the related embodiments, which are not described herein again.
The embodiment of the disclosure combines the virtual display technology and the voice recognition technology, thereby solving the technical problem that some special scenes and/or hearing-impaired people and the like cannot accurately receive voice information when watching the field information such as performance and the like.
In an optional implementation manner of this embodiment, step S202, namely the step of displaying the virtual information, further includes the following steps:
acquiring image data in the acquisition area;
acquiring a display image by superimposing the virtual information on the image data;
and displaying on the display unit according to the display image.
In this optional implementation, the display device further includes an image acquisition unit, such as a monocular camera or a monocular camera. The image data acquired in real time from the acquisition region may be a two-dimensional image or a three-dimensional image. After receiving the virtual information, the display device may perform computer rendering on the text content and the image data in the virtual information to obtain a display image, so as to display the text content on the image data in an overlapping manner, and display the display image on the display unit. It will be appreciated that other virtual information may also be superimposed on the image data if desired.
In an optional implementation manner of this embodiment, the display device includes a transparent display unit, and step S202, that is, the step of displaying the virtual information, further includes the following steps:
and displaying the text content on the transparent display unit, so that the text content is displayed on the information viewed by the user through the transparent display unit in an overlapping manner.
In this alternative implementation, the display device may be an AR display device, such as AR glasses. The display unit on the display device may be a transparent display unit, and when the user wears the display device, the transparent display unit may be located at an eye portion so that the user can view information in an environment through the transparent unit. The transparent display unit also has a display function, and can display virtual information, such as text content, so that when a user watches the information in the acquisition area through the transparent display unit, the text content can be displayed on the watched information in an overlapping manner. Because the natural reflected light in the surrounding environment can normally pass through the transparent display unit, the user can view the surrounding environment and things through the transparent display unit without influencing the sight of the wearer. By the method, the information in the acquisition area watched by the user is real, and meanwhile, the text content corresponding to the voice data sent by the object in the acquisition area can be seen without any intervention.
In an optional implementation manner of this embodiment, the method further includes the following steps:
receiving a target language configured by a user;
before displaying the virtual information, the method further comprises the following steps:
and when the text content in the virtual information is not matched with the target language, translating the text content in the virtual information into the target content corresponding to the target language.
In this optional implementation manner, the user may configure a target language for the display device through the client, and when the text content in the received virtual information is not matched with the target language configured by the user, the text content in the virtual information may be translated into target content corresponding to the target language, and then displayed on the display device. In this way, the display device may be adapted to target users using any language.
In some embodiments, a user may set a language category on a display device or a user device interacting with the display device, such as a mobile phone, that is, multiple selectable language categories may be preset on the display device, the user may select a corresponding category according to a language familiar to the user, and the display device automatically translates text content into the language category selected by the user when the text content in the received virtual information is inconsistent with the language category selected by the user.
In other embodiments, the user may also wear a headset, and the virtual information is displayed on the display device while the corresponding voice data is played on the headset. The user may also set language categories on the headset, and when the received voice data is inconsistent with the selected language category, the voice data may be automatically translated into the language category selected by the user. Of course, it is understood that the process of automatic translation may be completed at the server side, after the user selects the configured language category through the headset, the headset sends the configured language category to the server, and the server may translate the voice data into the voice corresponding to the configured language category and send the voice to the headset during the use process.
Fig. 3 shows a flow chart of a data processing method according to yet another embodiment of the present disclosure. As shown in fig. 2, the data processing method includes the steps of:
in step S301, acquiring voice data acquired by the voice acquisition unit;
in step S302, processing the voice data to obtain virtual information, where the virtual information includes text content corresponding to the voice data;
in step S303, the virtual information is output to the display unit to display the virtual information on the display unit.
In this embodiment, the data processing method may be implemented on a display device, for example, an AR display device. The display device may include a voice acquisition unit and a display unit. The speech acquisition unit may be, for example, a microphone array. The display device may be glasses, and the display unit may be disposed on the glasses. In the using process of the display device, the voice acquisition unit can acquire voice data in the surrounding environment in real time and output the acquired voice data to the processing unit on the display device, the processing unit processes the voice data to obtain virtual information, the virtual information can include text content corresponding to the voice data, and the virtual information is output to the display unit and then displayed. The capture area and the user area may be predetermined, for example, the capture area may be an area where a target object outputting a voice is located, and the user area may be an area where an object receiving a voice is located.
A user may view information in the environment, such as people, objects, scenes, etc., through a display unit on the display device. When a user views information in the environment through the display unit, the virtual information can be displayed in a superimposed mode within the sight range of the user. For example, when a user wears glasses to watch a stage performance, the voice of the stage performer can be superimposed and displayed in a stage scene watched by the user through the glasses.
In some embodiments, the virtual information may also include display information of the text content and other related information, such as display position, display mode, display format, and the like. The display device can display the text content at a proper position according to the display information and other related information in the virtual information, so that the user does not shield the viewing line when watching the object in the acquisition area, and the display device can also adjust the size of the displayed text after calculation by detecting the visual angle, the binocular distance and the like of the user, thereby avoiding the user from experiencing problems of caption defocusing or dizziness and the like.
Through the embodiment of the disclosure, the collected voice can be converted into the text content in real time by using the display equipment, and the virtual information including the text content is displayed to the user, so that the user can watch the text content corresponding to the voice data sent by the object in the environment through the display equipment while watching the information in the environment. The embodiment combines the virtual display technology and the voice recognition technology, so that the technical problem that some special scenes and/or hearing-impaired people cannot accurately receive voice information when watching the field information such as performances and the like is solved.
In an optional implementation manner of this embodiment, step S302, namely, the step of processing the voice data to obtain the virtual information, further includes the following steps:
preprocessing the voice data;
recognizing the preprocessed voice data by using an acoustic model to obtain corresponding candidate content;
and performing semantic processing on the candidate content by utilizing a semantic model to obtain the character content.
In this optional implementation manner, preprocessing such as noise reduction and filtering may be performed on the voice data, and the audio content of the target object in the acquisition region is extracted from the voice data. The target object can be any object which emits voice in the acquisition area, or one or more objects, the audio content of the target object can be extracted from the preprocessed voice data through functions such as audio recognition, and then the voice recognition is carried out on the audio content by using an acoustic model to obtain corresponding candidate content. And finally, performing context semantic processing on the candidate contents by using a semantic model, and outputting character contents conforming to semantic logic, wherein the semantic model can be obtained by training lines commonly used in the performance scene and a script. The acoustic model and the semantic model may employ models already implemented in the related art, and are not limited herein.
In an optional implementation manner of this embodiment, the display device includes an image capturing unit, and step S302 is to output the virtual information to the display unit to display the virtual information on the display unit, and further includes the following steps:
acquiring image data acquired by the image acquisition unit in real time;
acquiring a display image by superimposing the virtual information on the image data;
and outputting the display image to the display unit for displaying.
In this optional implementation, the display device further includes an image acquisition unit, such as a monocular camera or a monocular camera. The image acquisition unit is used for acquiring image data in the environment in real time, and the image data can be a two-dimensional image or a three-dimensional image. After the processing unit obtains the corresponding virtual information by using the voice data acquired by the voice acquisition unit, the processing unit can perform computer rendering on the text content and the image data in the virtual information to obtain a display image, so that the text content is displayed on the image data in an overlapping manner, and the display image is displayed on the display unit. It will be appreciated that other virtual information may also be superimposed on the image data if desired.
In an optional implementation manner of this embodiment, the display unit includes a transparent display unit, and step S302, that is, the step of outputting the virtual information to the display unit to display the virtual information on the display unit further includes the following steps:
and displaying the text content on the transparent display unit, so that the text content is displayed on the information viewed by the user through the transparent display unit in an overlapping manner.
In this optional implementation, the display unit on the display device may be a transparent display unit, and when the user wears the display device, the transparent display unit may be located at an eye position, so that the user can view information in an environment through the transparent unit. The transparent display unit also has a display function, and can display virtual information, such as text content, so that when a user watches the information in the acquisition area through the transparent display unit, the text content can be displayed on the watched information in an overlapping manner. Because the natural reflected light in the surrounding environment can normally pass through the transparent display unit, the user can view the surrounding environment and things through the transparent display unit without influencing the sight of the wearer. By the method, the information in the acquisition area watched by the user is real, and meanwhile, the text content corresponding to the voice data sent by the object in the environment can be seen without any intervention.
In an optional implementation manner of this embodiment, the method further includes the following steps:
receiving a target language configured by a user;
before outputting the virtual information to the display unit, the method further includes:
and when the text content in the virtual information is not matched with the target language, translating the text content in the virtual information into the target content corresponding to the target language.
In this optional implementation manner, the user may configure a target language for the display device through the client, and when the text content in the received virtual information is not matched with the target language configured by the user, the text content in the virtual information may be translated into target content corresponding to the target language, and then displayed on the display device. In this way, the display device may be adapted to target users using any language.
Fig. 4 shows a schematic flow chart of an application in a stage performance scene according to an embodiment of the present disclosure. As shown in fig. 4, a 360-degree microphone array 401 is provided around the stage for collecting voice data uttered by performers on the stage in real time. The processing device 402 may be located in the space of the stage or remotely, communicating with the microphone array 401 over a communications network. In the audience area, the audience may wear AR glasses 403 to watch the performance on the stage. The AR glasses may communicate with the processing device 402 over a communication network. During the performance, the microphone array 401 sends the voice data collected in real time to the processing device 402 through the network, and after being processed by the processing device 402, subtitle information can be obtained, and the subtitle information is sent to the AR glasses 403 worn by the audience in the audience area through the network and displayed on the AR glasses 403 in real time, so that the audience can superpose and display subtitles on the AR glasses 403 while watching the object performance.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.
According to the data processing apparatus of an embodiment of the present disclosure, the apparatus may be implemented as part or all of an electronic device by software, hardware, or a combination of both. The data processing apparatus includes:
the first acquisition module is configured to acquire voice data in an acquisition area;
the first processing module is configured to process the voice data to obtain virtual information, and the virtual information comprises text content corresponding to the voice data;
the first output module is configured to output the virtual information to at least one display device in a user area so as to display the virtual information on the display device, so that a user can view the virtual information in a user sight range when watching the acquisition area through the display device.
In an optional implementation manner of this embodiment, the first obtaining module includes:
the first acquisition sub-module is configured to acquire voice data acquired by a voice acquisition device from the voice acquisition device arranged in the acquisition area.
In an optional implementation manner of this embodiment, the first processing module includes:
a first preprocessing submodule configured to preprocess the voice data;
the first recognition submodule is configured to recognize the preprocessed voice data by using an acoustic model to obtain corresponding candidate content;
and the first semantic processing submodule is configured to perform semantic processing on the candidate content by using a semantic model to obtain the text content.
In an optional implementation manner of this embodiment, after the first processing module, the apparatus further includes:
the first translation module is configured to translate the text content into target content corresponding to a target language associated with the display device.
In an alternative implementation of this embodiment, the display device includes an AR display device.
The data processing apparatus in this embodiment corresponds to the data processing method in the embodiment and the related embodiment shown in fig. 1, and specific details can be referred to the above description of the data processing method in the embodiment and the related embodiment shown in fig. 1, and are not described herein again.
According to the data processing apparatus of another embodiment of the present disclosure, the apparatus may be implemented as part or all of an electronic device by software, hardware, or a combination of both. The apparatus is located on a display device, and the data processing apparatus includes:
a second acquisition module configured to acquire virtual information; the virtual information comprises text contents obtained by identifying voice data collected in the collection area;
the display module is configured to display the virtual information so that the user can see the virtual information in a user sight range when watching the acquisition area through the display device.
In an optional implementation manner of this embodiment, the display module includes:
an acquisition sub-module configured to acquire image data within the acquisition region;
a second acquisition sub-module configured to acquire a display image by superimposing the virtual information on the image data;
a first display sub-module configured to display the display image.
In an optional implementation manner of this embodiment, the display device includes a transparent display unit, and the display module includes:
a second display sub-module configured to display the text content on the transparent display unit such that the text content is displayed superimposed on information viewed by a user through the transparent display unit.
In an optional implementation manner of this embodiment, the method further includes:
a first receiving module configured to receive a target language configured by a user;
before the display module, the method further comprises:
and the second translation module is configured to translate the text content in the virtual information into the target content corresponding to the target language when the text content in the virtual information is not matched with the target language.
The data processing apparatus in this embodiment corresponds to the data processing method in the embodiment and the related embodiment shown in fig. 2, and specific details can be referred to the above description of the data processing method in the embodiment and the related embodiment shown in fig. 2, which is not described herein again.
According to a data processing apparatus of still another embodiment of the present disclosure, the apparatus may be implemented as a part or all of an electronic device by software, hardware, or a combination of both. The device is located display device, and this display device includes pronunciation acquisition element and display element, and this data processing apparatus includes:
the third acquisition module is configured to acquire the voice data acquired by the voice acquisition unit;
the second processing module is configured to process the voice data to obtain virtual information, and the virtual information comprises text contents corresponding to the voice data;
a second output module configured to output the virtual information to the display unit to display the virtual information on the display unit.
In an optional implementation manner of this embodiment, the second processing module includes:
a second preprocessing submodule configured to preprocess the voice data;
the second recognition submodule is configured to recognize the preprocessed voice data by using an acoustic model to obtain corresponding candidate content;
and the second semantic processing submodule is configured to perform semantic processing on the candidate content by using a semantic model to obtain the text content.
In an optional implementation manner of this embodiment, the display device includes an image capturing unit, and the second output module includes:
a third acquisition sub-module configured to acquire image data acquired by the image acquisition unit;
a fourth acquisition sub-module configured to acquire a display image by superimposing the virtual information on the image data;
a first output sub-module configured to output the display image to the display unit for display.
In an optional implementation manner of this embodiment, the display unit includes a transparent display unit, and the second output module includes:
and the second output sub-module is configured to output the text content to the transparent display unit for displaying, so that the text content is displayed in a manner of being superposed on the information viewed by the user through the transparent display unit.
In an optional implementation manner of this embodiment, the apparatus further includes:
a second receiving module configured to receive a target language configured by a user;
before the second output module, the apparatus further includes:
and the third translation module is configured to translate the text content in the virtual information into the target content corresponding to the target language when the text content in the virtual information is not matched with the target language.
The data processing apparatus in this embodiment corresponds to the data processing method in the embodiment and the related embodiment shown in fig. 3, and specific details can be referred to the above description of the data processing method in the embodiment and the related embodiment shown in fig. 3, which is not described herein again.
Fig. 5 is a schematic structural diagram of an electronic device suitable for implementing a data processing method according to an embodiment of the present disclosure.
As shown in fig. 5, the electronic device 500 includes a processing unit 501, which may be implemented as a CPU, GPU, FPGA, NPU, or the like processing unit. The processing unit 501 may perform various processes in the embodiments of any one of the methods described above of the present disclosure according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing unit 501, the ROM502, and the RAM503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to embodiments of the present disclosure, any of the methods described above with reference to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing any of the methods of the embodiments of the present disclosure. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (18)

1. A data processing method, comprising:
acquiring voice data in an acquisition area;
processing the voice data to obtain virtual information, wherein the virtual information comprises text contents corresponding to the voice data;
and outputting the virtual information to at least one display device in the user area so as to display the virtual information on the display device, so that the user can view the virtual information in the sight line range of the user when viewing the acquisition area through the display device.
2. The method of claim 1, wherein processing the voice data to obtain virtual information comprises:
preprocessing the voice data;
recognizing the preprocessed voice data by using an acoustic model to obtain corresponding candidate content;
and performing semantic processing on the candidate content by utilizing a semantic model to obtain the character content.
3. The method according to claim 1 or 2, wherein after processing the voice data to obtain the virtual information, further comprising:
and translating the text content into target content corresponding to the target language associated with the display equipment.
4. The method of claim 1 or 2, wherein the display device comprises an AR display device.
5. A data processing method, wherein the method is performed on a display device, comprising:
acquiring virtual information; the virtual information comprises text contents obtained by identifying voice data collected in the collection area;
and displaying the virtual information so that the user can see the virtual information in the sight range of the user when watching the acquisition area through the display equipment.
6. The method of claim 5, wherein displaying the virtual information comprises:
acquiring image data in the acquisition area;
acquiring a display image by superimposing the virtual information on the image data;
and displaying the display image.
7. The method of claim 5 or 6, wherein the display device comprises a transparent display unit, displaying the virtual information, comprising:
and displaying the text content on the transparent display unit, so that the text content is displayed on the information viewed by the user through the transparent display unit in an overlapping manner.
8. The method of claim 5 or 6, wherein the method further comprises:
receiving a target language configured by a user;
before displaying the virtual information, the method further comprises the following steps:
and when the text content in the virtual information is not matched with the target language, translating the text content in the virtual information into the target content corresponding to the target language.
9. A data processing method, wherein the method is performed on a display device, the display device comprising a voice acquisition unit and a display unit, comprising:
acquiring voice data acquired by the voice acquisition unit;
processing the voice data to obtain virtual information, wherein the virtual information comprises text contents corresponding to the voice data;
outputting the virtual information to the display unit to display the virtual information on the display unit.
10. The method of claim 9, wherein processing the voice data to obtain virtual information comprises:
preprocessing the voice data;
recognizing the preprocessed voice data by using an acoustic model to obtain corresponding candidate content;
and performing semantic processing on the candidate content by utilizing a semantic model to obtain the character content.
11. The method of claim 9 or 10, wherein the display device comprises an image acquisition unit that outputs the virtual information to the display unit for display thereon, comprising:
acquiring image data acquired by the image acquisition unit;
acquiring a display image by superimposing the virtual information on the image data;
and outputting the display image to the display unit for displaying.
12. The method of claim 9 or 10, wherein the display unit comprises a transparent display unit, outputting the virtual information to the display unit for display thereon, comprising:
and outputting the text content to the transparent display unit for displaying, so that the text content is displayed on the information viewed by the user through the transparent display unit in an overlapping manner.
13. The method according to claim 9 or 10, wherein the method further comprises:
receiving a target language configured by a user;
before outputting the virtual information to the display unit, the method further includes:
and when the text content in the virtual information is not matched with the target language, translating the text content in the virtual information into the target content corresponding to the target language.
14. A data processing apparatus, comprising:
the first acquisition module is configured to acquire voice data in an acquisition area;
the first processing module is configured to process the voice data to obtain virtual information, and the virtual information comprises text content corresponding to the voice data;
the first output module is configured to output the virtual information to at least one display device in a user area so as to display the virtual information on the display device, so that a user can view the virtual information in a user sight range when watching the acquisition area through the display device.
15. A data processing apparatus, wherein the apparatus is located on a display device, the apparatus comprising:
a second acquisition module configured to acquire virtual information; the virtual information comprises text contents obtained by identifying voice data collected in the collection area;
the display module is configured to display the virtual information so that the user can see the virtual information in a user sight range when watching the acquisition area through the display device.
16. A data processing apparatus, wherein the apparatus is located in a display device, the display device comprising a speech acquisition unit and a display unit, the apparatus comprising:
the third acquisition module is configured to acquire the voice data acquired by the voice acquisition unit;
the second processing module is configured to process the voice data to obtain virtual information, and the virtual information comprises text contents corresponding to the voice data;
a second output module configured to output the virtual information to the display unit to display the virtual information on the display unit.
17. An electronic device, comprising a memory and a processor; wherein the content of the first and second substances,
the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any one of claims 1-13.
18. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the method of any of claims 1-13.
CN202010627737.1A 2020-07-01 2020-07-01 Data processing method and device, electronic equipment and storage medium Pending CN113889114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010627737.1A CN113889114A (en) 2020-07-01 2020-07-01 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010627737.1A CN113889114A (en) 2020-07-01 2020-07-01 Data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113889114A true CN113889114A (en) 2022-01-04

Family

ID=79012490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010627737.1A Pending CN113889114A (en) 2020-07-01 2020-07-01 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113889114A (en)

Similar Documents

Publication Publication Date Title
KR101995958B1 (en) Apparatus and method for image processing based on smart glass
CN108200446B (en) On-line multimedia interaction system and method of virtual image
US8201080B2 (en) Systems and methods for augmenting audio/visual broadcasts with annotations to assist with perception and interpretation of broadcast content
KR101899588B1 (en) System for automatically generating a sign language animation data, broadcasting system using the same and broadcasting method
CN111654715A (en) Live video processing method and device, electronic equipment and storage medium
JP2019220848A (en) Data processing apparatus, data processing method and program
WO2017141584A1 (en) Information processing apparatus, information processing system, information processing method, and program
CN108475492B (en) Head-mounted display cooperative display system, system including display device and head-mounted display, and display device thereof
CN115668913A (en) Stereoscopic display method, device, medium and system for field performance
CN114339302B (en) Method, device, equipment and computer storage medium for guiding broadcast
JP5346797B2 (en) Sign language video synthesizing device, sign language video synthesizing method, sign language display position setting device, sign language display position setting method, and program
CN111246224A (en) Video live broadcast method and video live broadcast system
KR20110118530A (en) System and device for displaying of video data
JP7385385B2 (en) Image distribution system and image distribution method
KR20120074977A (en) Educational materials and methods using augmented reality the performance of voice recognition and command
CN113889114A (en) Data processing method and device, electronic equipment and storage medium
KR101705988B1 (en) Virtual reality apparatus
CN112764549B (en) Translation method, translation device, translation medium and near-to-eye display equipment
KR102258991B1 (en) Sign-language service providing system
JP2020162083A (en) Content distribution system, content distribution method, and content distribution program
KR101856632B1 (en) Method and apparatus for displaying caption based on location of speaker and apparatus for performing the same
CN111736692B (en) Display method, display device, storage medium and head-mounted device
US20210174823A1 (en) System for and Method of Converting Spoken Words and Audio Cues into Spatially Accurate Caption Text for Augmented Reality Glasses
WO2021226821A1 (en) Systems and methods for detection and display of whiteboard text and/or an active speaker
CN110910508B (en) Image display method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination