CN114401417B - Live stream object tracking method, device, equipment and medium thereof - Google Patents

Live stream object tracking method, device, equipment and medium thereof Download PDF

Info

Publication number
CN114401417B
CN114401417B CN202210106703.7A CN202210106703A CN114401417B CN 114401417 B CN114401417 B CN 114401417B CN 202210106703 A CN202210106703 A CN 202210106703A CN 114401417 B CN114401417 B CN 114401417B
Authority
CN
China
Prior art keywords
stream
target object
live
information
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210106703.7A
Other languages
Chinese (zh)
Other versions
CN114401417A (en
Inventor
曾家乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Cubesili Information Technology Co Ltd
Original Assignee
Guangzhou Cubesili Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Cubesili Information Technology Co Ltd filed Critical Guangzhou Cubesili Information Technology Co Ltd
Priority to CN202210106703.7A priority Critical patent/CN114401417B/en
Publication of CN114401417A publication Critical patent/CN114401417A/en
Application granted granted Critical
Publication of CN114401417B publication Critical patent/CN114401417B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44012Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44218Detecting physical presence or behaviour of the user, e.g. using sensors to detect if the user is leaving the room or changes his face expression during a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Social Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Marketing (AREA)
  • Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application relates to the technical field of network live broadcast, and discloses a live stream object tracking method, a device, equipment and a medium thereof, wherein the method comprises the following steps: pushing a live stream to a live broadcast room, wherein the live stream comprises a video stream and an audio stream, and the audio stream comprises audio data input by an audio input device; performing voice recognition on the audio data to obtain a corresponding dictation text, and determining a target object pointed by the dictation text; identifying the target object from the image stream, and acquiring edge contour information of the target object in a video frame of the video stream; pushing the edge profile information serving as positioning tracking information to the live broadcasting room, so that terminal equipment receiving the positioning tracking information highlights the profile of the target object in a playing interface of the video stream. The method and the device improve the readability of the live content in the graphical user interface, thereby improving user experience.

Description

Live stream object tracking method, device, equipment and medium thereof
Technical Field
The present disclosure relates to the field of network live broadcasting technologies, and in particular, to a live-stream object tracking method, and a corresponding apparatus, computer device, and computer-readable storage medium thereof.
Background
The network video live broadcast can rapidly and effectively transmit information, and has the characteristics of site property, real-time property, intuitiveness, entertainment and the like. One form of application for webcasting is for commentary for competitive or entertainment programming between game or sports, for which reason commentary-type webcasting costs are technically supported as new hotspots in the art. In the past, popular live broadcast projects are commonly multi-player sports projects, when a host player carries out explanation, players in the explained projects or objects of using props, skills and the like of the players are often mentioned, and when the host player mentions the objects, a viewer still needs to spend a little effort to react and position each comment player so as to keep up with the explanation progress of the host player. Since it takes time to locate the subject of the anchor narrative, the viewing experience of the audience is easily affected.
In order to enable audience users to quickly track the content of the anchor commentary and improve user experience, the application attempts to explore more technical schemes suitable for meeting actual requirements.
Disclosure of Invention
It is a primary object of the present application to solve at least one of the above problems and provide a live stream object tracking method and corresponding apparatus, computer device, computer readable storage medium.
In order to meet the purposes of the application, the application adopts the following technical scheme:
the live stream object tracking method provided by adapting to one of the purposes of the application comprises the following steps:
pushing a live stream to a live broadcasting room, wherein the live stream comprises a video stream and an audio stream, the audio stream comprises audio data input by audio input equipment, and the video stream comprises an image stream corresponding to a display interface of a third-party program;
performing voice recognition on the audio data to obtain a corresponding dictation text, and determining a target object pointed by the dictation text;
identifying the target object from the image stream, and acquiring edge contour information of the target object in a video frame of the video stream;
pushing the edge profile information serving as positioning tracking information to the live broadcasting room, so that terminal equipment receiving the positioning tracking information highlights the profile of the target object in a playing interface of the video stream.
In a further embodiment, pushing the live stream to the live room comprises the steps of:
acquiring an image stream corresponding to a display interface of a third-party program from a video memory;
receiving video data shot by camera equipment connected with a host client device;
Receiving audio data input by an audio input device connected with a host client device;
and synthesizing the image stream and the video data into a video stream, synthesizing the video stream and the audio data into the live stream, and pushing the live stream to a live broadcasting room for playing.
In a further embodiment, performing voice recognition on the audio data to obtain a corresponding spoken text, and determining a target object to which the spoken text points, including the steps of:
extracting deep acoustic features of the audio data, and constructing corresponding acoustic feature vectors;
calling a first neural network model according to the acoustic feature vector to obtain a corresponding phoneme sequence, and decoding the phoneme sequence to obtain the dictation text;
and matching the dictation text according to object text information in a preset information list, and obtaining object text information matched with the dictation text so as to confirm the target object.
In an extended embodiment, matching the dictation text according to the object text information in the preset information list to obtain the object text information matched with the dictation text, and before confirming the target object, the method comprises the following steps:
acquiring entries corresponding to descriptions of competitive items, wherein the competitive items comprise game items or sports competition items;
Screening out object text information corresponding to the character names of target objects participating in the athletic project and the character skill names;
and storing the object text information associated with the corresponding target object in the information list.
In a further embodiment, identifying the target object from the image stream, and acquiring edge contour information of the target object in a video frame of the video stream includes the following steps:
extracting deep picture features of each video frame in the video stream, and constructing corresponding picture feature vectors;
invoking a second neural network model according to the picture feature vector to identify the target object in the video frame of the video stream, and obtaining the real-time position of the target object in the video frame of the video stream;
and calling a third neural network model to segment out a picture feature vector corresponding to the target object, and performing edge compensation calculation on the picture feature vector to obtain edge contour information of the target object, wherein the edge contour information comprises an edge contour corresponding to the target object and a corresponding real-time position of the edge contour.
In a further embodiment, the edge profile information is pushed to the live broadcasting room as positioning tracking information, so that a terminal device receiving the positioning tracking information highlights the profile of the target object in a playing interface of the video stream, and the method includes the following steps:
The edge profile information corresponding to the target object is associated to obtain a time stamp of a video frame corresponding to the video stream at the time to form positioning tracking information, the positioning tracking information is uploaded to a server and pushed to the live broadcasting room, and the server sends the positioning tracking information to terminal equipment connected with the live broadcasting room;
and the detection terminal equipment displays the starting state of the tracking object switch, and if the state is detected to be on, the edge contour color is rendered for the target object according to the positioning tracking information, so that the contour of the target object is highlighted in the playing interface of the video stream.
In a preferred embodiment, rendering edge contour colors for the target object according to the positioning tracking information, so as to highlight the contour of the target object in the playing interface of the video stream, including the following steps:
positioning the edge contour in the edge contour information in the video frame of the video stream according to the real-time position in the edge contour information in the positioning and tracking information, and extracting the peripheral color of the edge contour in the video frame;
confirming a corresponding color gamut according to the color value with the highest duty ratio in the peripheral colors of the edge contour, and acquiring the color value which is different from the color gamut and is set as the edge contour color of the target object;
And rendering the edge contour of the video frame by adopting the edge contour color of the target object so as to display the edge contour of the target object in a playing interface of the video stream.
A live stream object tracking apparatus provided in accordance with one of the objects of the present application, comprising: the system comprises a live broadcast stream pushing module, a voice translation module, an image recognition module and a profile display module, wherein the live broadcast stream pushing module is used for pushing a live broadcast stream to a live broadcast room, the live broadcast stream comprises a video stream and an audio stream, the audio stream comprises audio data input by audio input equipment, and the video stream comprises an image stream corresponding to a display interface of a third-party program; the voice translation module is used for carrying out voice recognition on the audio data to obtain a corresponding dictation text, and determining a target object pointed by the dictation text; the image identification module is used for identifying the target object from the image stream and acquiring edge contour information of the target object in a video frame of the video stream; and the contour display module is used for pushing the edge contour information serving as positioning tracking information to the live broadcasting room, so that the terminal equipment receiving the positioning tracking information highlights the contour of the target object in the playing interface of the video stream.
In a further embodiment, the live stream pushing module includes: the image stream acquisition sub-module is used for acquiring the image stream corresponding to the display interface of the third-party program from the video memory; the video data receiving sub-module is used for receiving video data shot by the camera equipment connected with the anchor client equipment; the audio data receiving sub-module is used for receiving audio data input by audio input equipment connected with the anchor client equipment; and the video stream synthesis sub-module is used for synthesizing the image stream and the video data into a video stream, synthesizing the video stream and the audio data into the live stream, and pushing the live stream to a live broadcasting room for playing.
In a further embodiment, the speech translation module includes: the feature extraction sub-module is used for extracting deep acoustic features of the audio data and constructing corresponding acoustic feature vectors; the decoding submodule is used for calling the first neural network model according to the acoustic feature vector to obtain a corresponding phoneme sequence, and decoding the phoneme sequence to obtain the dictation text; and the target object confirmation sub-module is used for matching the dictation text according to the object text information in the preset information list, obtaining the object text information matched with the dictation text and confirming the target object by the target object confirmation sub-module.
In an extended embodiment, before the target object confirms the sub-module, the method includes the following steps: the entry acquisition unit is used for acquiring entries corresponding to the descriptions of the competitive items, wherein the competitive items comprise game items or sports competition items; the text information screening unit is used for screening object text information corresponding to the character names and the character skill names of the target objects participating in the athletic project; and the storage unit is used for storing the object text information associated with the corresponding target object in the information list.
In a further embodiment, the image recognition module 1300 includes: the feature extraction submodule is used for extracting deep picture features of each video frame in the video stream and constructing corresponding picture feature vectors; the target object identification sub-module is used for calling a second neural network model according to the picture feature vector to identify the target object in the video frame of the video stream and obtaining the real-time position of the target object in the video frame of the video stream; and the edge profile information sub-module is used for calling a third neural network model to segment out a picture feature vector corresponding to the target object, performing edge compensation calculation on the picture feature vector to obtain edge profile information of the target object, wherein the edge profile information comprises an edge profile corresponding to the target object and a corresponding real-time position of the edge profile.
In a further embodiment, the contour display module includes: the positioning tracking information sub-module is used for correlating the edge profile information corresponding to the target object to obtain a time stamp corresponding to a video frame in the video stream to form positioning tracking information, uploading the positioning tracking information to a server and pushing the positioning tracking information to the live broadcasting room so that the server can send the positioning tracking information to terminal equipment connected with the live broadcasting room; and the edge contour rendering sub-module is used for detecting the starting state of a switch of a display tracking object of the terminal equipment, and rendering edge contour color for the target object according to the positioning tracking information if the state is detected to be on, so that the contour of the target object is highlighted in a playing interface of the video stream.
In a preferred embodiment, the edge contour rendering sub-module includes: the color acquisition unit is used for positioning the edge contour in the edge contour information in the video frame of the video stream according to the real-time position in the edge contour information in the positioning and tracking information and extracting the peripheral color of the edge contour in the video frame; a color confirmation unit, configured to confirm a color gamut corresponding to the edge contour according to a color value with the highest ratio in the peripheral colors of the edge contour, and acquire a color value different from the color gamut as an edge contour color of the target object; and the color rendering unit is used for rendering the edge contour of the video frame by adopting the edge contour color of the target object so as to display the edge contour of the target object in a playing interface of the video stream.
A computer device provided in accordance with one of the objects of the present application comprises a central processor and a memory, the central processor being adapted to invoke the steps of running a computer program stored in the memory to perform the live stream object tracking method described herein.
A computer readable storage medium adapted for the purposes of the present application stores in the form of computer readable instructions a computer program implemented according to the live stream object tracking method, which when invoked by a computer, performs the steps comprised by the method.
Compared with the prior art, the method has the following advantages:
according to the method, the server is used for pushing the live stream synthesized by the graphic stream and the audio stream generated by the live user to the live broadcasting room, after the live stream is received by the user connected to the live broadcasting room, voice recognition is carried out on the audio data in the live broadcasting room to obtain corresponding dictation text, so that the pointed target object is determined, further, the target object is identified from the image stream, edge profile information of the target object in a video frame of the video stream and a timestamp corresponding to the video frame are obtained to form positioning tracking information, the positioning tracking information is uploaded to the server, the server is used for pushing the positioning tracking information to the live broadcasting room, the terminal equipment connected to the live broadcasting room is used for receiving the positioning tracking information, the profile of the target object is rendered according to the positioning tracking information to be highlighted in a video playing interface, manual participation is not needed in the whole process, the voice can be accurately and quickly dictated according to a host user, the profile of the target object is identified, the user can clearly and quickly identify the target object according to the marking profile, the readability of live broadcasting content in the graphic user interface is improved, and user experience is improved.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a typical network deployment architecture relevant to implementing the technical solutions of the present application;
FIG. 2 is a flow chart of an exemplary embodiment of a live stream object tracking method of the present application;
fig. 3 (a) and fig. 3 (b) are schematic diagrams of a graphical user interface of a terminal device in an embodiment of the present application, which are respectively a user permission popup interface and a tracking frame selection interface for displaying a live stream object effect;
fig. 4 is a schematic flow chart of live stream composition pushing in the embodiment of the present application;
FIG. 5 is a flow chart of speech recognition in an embodiment of the present application;
fig. 6 is a schematic flow chart of information list construction in the embodiment of the application;
fig. 7 is a schematic flow chart of acquiring edge contour information of a target object in an embodiment of the present application;
FIG. 8 is a flowchart illustrating a process of displaying an edge contour of a target object on a graphical user interface according to an embodiment of the present application;
FIG. 9 is a flowchart illustrating a process of determining edge contour color of a target object according to an embodiment of the present application;
FIG. 10 is a functional block diagram of a live stream object tracking method of the present application;
Fig. 11 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As used herein, "client," "terminal device," and "terminal device" are understood by those skilled in the art to include both devices that include only wireless signal receivers without transmitting capabilities and devices that include receiving and transmitting hardware capable of two-way communication over a two-way communication link. Such a device may include: a cellular or other communication device such as a personal computer, tablet, or the like, having a single-line display or a multi-line display or a cellular or other communication device without a multi-line display; a PCS (Personal Communications Service, personal communication system) that may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant ) that can include a radio frequency receiver, pager, internet/intranet access, web browser, notepad, calendar and/or GPS (Global Positioning System ) receiver; a conventional laptop and/or palmtop computer or other appliance that has and/or includes a radio frequency receiver. As used herein, "client," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or adapted and/or configured to operate locally and/or in a distributed fashion, at any other location(s) on earth and/or in space. As used herein, a "client," "terminal device," or "terminal device" may also be a communication terminal, an internet terminal, or a music/video playing terminal, for example, a PDA, a MID (Mobile Internet Device ), and/or a mobile phone with music/video playing function, or may also be a device such as a smart tv, a set top box, or the like.
The hardware referred to by the names "server", "client", "service node" and the like in the present application is essentially an electronic device having the performance of a personal computer, and is a hardware device having necessary components disclosed by von neumann's principle, such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, and an output device, and a computer program is stored in the memory, and the central processing unit calls the program stored in the external memory to run in the memory, executes instructions in the program, and interacts with the input/output device, thereby completing a specific function.
It should be noted that the concept of "server" as referred to in this application is equally applicable to the case of a server farm. The servers should be logically partitioned, physically separate from each other but interface-callable, or integrated into a physical computer or group of computers, according to network deployment principles understood by those skilled in the art. Those skilled in the art will appreciate this variation and should not be construed as limiting the implementation of the network deployment approach of the present application.
Referring to fig. 1, the hardware base required for implementing the related technical solution of the present application may be deployed according to the architecture shown in the figure. The server 80 is deployed at the cloud as a service server, and may be responsible for further connecting to related data servers and other servers providing related support, so as to form a logically related service cluster, to provide services for related terminal devices, such as a smart phone 81 and a personal computer 82 shown in the figure, or a third party server (not shown). The smart phone and the personal computer can access the internet through a well-known network access mode, and establish a data communication link with the cloud server 80 so as to run a terminal application program related to the service provided by the server.
For the server, the application program is generally constructed as a service process, and a corresponding program interface is opened for remote call of the application program running on various terminal devices.
The application program refers to an application program running on a server or terminal equipment, the application program adopts a programming mode to realize the related technical scheme of the application, the program codes of the application program can be stored in a nonvolatile storage medium which can be identified by a computer in the form of computer executable instructions, and the program codes are called by a central processing unit to run in a memory, and the related device of the application is constructed by the running of the application program on the computer.
One or several technical features of the present application, unless specified in the plain text, may be deployed either on a server to implement access by remotely invoking an online service interface provided by the acquisition server by a client, or directly deployed and run on the client to implement access.
The various data referred to in the present application, unless specified in the plain text, may be stored either remotely in a server or in a local terminal device, as long as it is suitable for being invoked by the technical solution of the present application.
Those skilled in the art will appreciate that: although the various methods of the present application are described based on the same concepts so as to be common to each other, the methods may be performed independently, unless otherwise indicated. Similarly, for each of the embodiments disclosed herein, the concepts presented are based on the same inventive concept, and thus, the concepts presented for the same description, and concepts that are merely convenient and appropriately altered although they are different, should be equally understood.
The various embodiments to be disclosed herein, unless the plain text indicates a mutually exclusive relationship with each other, the technical features related to the various embodiments may be cross-combined to flexibly construct a new embodiment, so long as such combination does not depart from the inventive spirit of the present application and can satisfy the needs in the art or solve the deficiencies in the prior art. This variant will be known to the person skilled in the art.
Referring to fig. 2, in an exemplary embodiment, the live stream object tracking method of the present application includes the following steps:
step S1100, pushing a live stream to a live broadcasting room, wherein the live stream comprises a video stream and an audio stream, the audio stream comprises audio data input by audio input equipment, and the video stream comprises an image stream corresponding to a display interface of a third-party program;
the live broadcast stream comprises a video stream and an audio stream, the audio stream comprises audio data input by audio input equipment, the audio data refers to an application program developed by a network live broadcast platform installed by a user at a terminal device serving as a client, and after obtaining a microphone device of the terminal device of the user, namely, a corresponding authority of an audio input device of a host client device, the audio data corresponding to the microphone device is received; the video stream comprises an image stream corresponding to a display interface of a third party program, the image stream refers to an image stream which is developed and maintained by a third party platform, when a user logs in and runs at a terminal device which is used as a client and loads the application program, the third party platform server provides corresponding service, so that a corresponding CPU component of the terminal device is called for rendering, and finally the image stream is displayed at the image user interface of the terminal device, and the application program comprises a multi-player competitive game and a sports channel player.
In one embodiment, the audio stream further comprises audio data in the image stream, wherein the audio data refers to audio data when a corresponding third party program in the image stream runs; the video stream also comprises video data input by the video input device of the anchor client device or the video input device connected with the anchor client device, wherein the video data refers to a live broadcast platform development and maintenance live broadcast application program, when a live broadcast user logs in and runs at a terminal device which is used as a client and loads the application program, the application program is used for acquiring the authority of calling the camera shooting function of the terminal device and the camera shooting data from the live broadcast user, so that the camera shooting data of the anchor user is shot in real time by the camera shooting device which is externally connected with the terminal device or is arranged in the terminal device.
The video stream and the audio stream are correspondingly encoded and compressed to be synthesized into the live stream, further, in order to respond to a live broadcast request triggered when a client of a network live broadcast platform loads a live broadcast room opened by a live broadcast user, a server pushes the live stream to each client of the live broadcast room in real time, and enabling the client to receive the live stream and perform corresponding decoding operation to obtain video stream and audio stream data, calling a player to play audio and video in the live broadcasting room, for example, loading and playing the image stream video into the region 300b of the graphical user interface of the live broadcasting room, and loading the video data into the region 302b of the graphical user interface of the live broadcasting room, which is played into the region 302b of the graphical user interface of the live broadcasting room.
Step 1200, performing voice recognition on the audio data to obtain a corresponding dictation text, and determining a target object pointed by the dictation text;
the speech recognition is automatic speech recognition Automatic Speech Recognition, (ASR) with the aim of converting the lexical content in human speech into computer readable inputs, such as keys, binary codes or character sequences, which are further converted by the computer into human readable text output, typically for application purposes.
Methods suitable for the speech recognition include, but are not limited to, two methods such as template matching and a method using an artificial neural network, which are exemplified below, and the method of template matching is mainly divided into four steps: the common technical means include three Dynamic Time Warping (DTW), hidden Markov (HMM) theory and Vector Quantization (VQ) technologies; the method for utilizing the artificial neural network can be based on the artificial neural network and a hybrid algorithm, such as an ANN/HMM method, an FSVQ/HMM method, a GMM/HMM method, a DNN/HMM method and the like, wherein the FSVQ is a finite state vector quantization algorithm.
In an embodiment, firstly, the audio data is preprocessed so that the features are extracted more effectively, the silence of the first and the last segments is cut off to eliminate the interference to the subsequent steps, the implementation mode of the recommended reference is VAD (voice endpoint detection), then, the sound is divided into small segments, each small segment is called a frame, the frames are overlapped, the implementation mode of the recommended reference is a moving window function, secondly, the waveform corresponding to the sound of each frame is changed into a multidimensional feature vector containing sound information, the implementation mode of the recommended reference is Linear Prediction Cepstrum Coefficient (LPCC) and MEL cepstrum coefficient (MFCC) algorithm, further, the multidimensional feature vector is subjected to corresponding decoding operation, the steps are divided into three steps, each step is respectively called an acoustic model, a dictionary and a language model, the multidimensional feature vector is input into an Acoustic Model (AM) pre-trained to be converged to obtain phoneme information output by the acoustic model, the recommended reference is GMM+HMM and DNN+HMM, the acoustic model is obtained by the acoustic model, the acoustic model of the recommended reference is matched with the word of the Chinese character set, the Chinese character set is calculated according to the word, the word is matched with the word in the word set after the word is matched with the word, the word is matched with the word in the word set, and the word is matched with the word is obtained by the word, the word is matched with the word in the word set, the implementation modes, the acoustic model, the dictionary and the language model can be flexibly selected and implemented by a person skilled in the art according to requirements.
In summary, after the dictation text is obtained, the dictation text is matched with a target object text in a preset information list, and then the target object is confirmed according to the mapping relation between the matched target object text and the unique identification code of the target object, wherein the information list stores unique identification codes of all target objects and corresponding target object text information thereof, and an example is shown by way of example, the dictation text is "mexiletree", then the dictation text is matched with the target object text in the preset information list, and the target object text in the information list is hit to be "Mei Xi", and then the unique identification code corresponding to "Mei Xi" is used for confirming "Mei Xi" as the target object.
Step 1300, identifying the target object from the image stream, and obtaining edge contour information of the target object in a video frame of the video stream;
and invoking an object detection neural network model pre-trained to be converged according to the unique identification code of the target object, and detecting the target object for each frame of picture of the image stream, so that the target object can be identified once each picture appears, and further tracking and identification of the target object can be realized for the image stream.
In one embodiment, a convolutional layer in a neural network model is detected through pre-training to a converged object, each frame of picture of the image stream is correspondingly structured into a certain category of information, namely, the category of each object in the picture, then, the category of information is classified by using a Softmax classifier, each object in the picture is identified by a predefined category (String) and an instance ID to realize classification among objects, through the object detection neural network model, each object is detected in a multi-scale manner, the target object is identified, the position of the target object in the picture is confirmed, the position is generally represented by coordinates of a rectangular detection frame, the neural network model is segmented by invoking the pre-training to the converged instance, the edge contour corresponding to the target object and the position information thereof in the picture are obtained, so far, each video frame of the image stream is identified, and the edge contour information corresponding to the target object in each video frame of the video stream is obtained, and the edge contour information contains the edge contour data of the target object and the position of the image corresponding to the edge contour data in each frame of the image. The object detection neural network model can adopt a Faster-RCNN object detection model, a YOLO object detection model, an SSD object detection network model, an R-FCN object detection model, an EfficientDet object detection network and the like, and the instance segmentation model can adopt a Mask-RCNN instance segmentation model, a YOLACT instance segmentation model, a Deepmask instance segmentation model, a box instance segmentation model and the like, wherein the specific model is selected, and the person skilled in the art can flexibly change the model according to the actual service requirement.
And step 1400, pushing the edge profile information serving as positioning tracking information to the live broadcasting room, so that terminal equipment receiving the positioning tracking information highlights the profile of the target object in a playing interface of the video stream.
The server pushes the edge profile information to a terminal device of a client used by each audience in the live broadcasting room as positioning tracking information corresponding to a target object dictated by a host user in the current live broadcasting room, in one embodiment, the terminal device receiving the positioning tracking information starts a front camera, a corresponding function of a preset face recognition interface is called to recognize whether a face exists in front of a screen of the terminal device, the preset face recognition interface can be generally called directly by a person skilled in the art, a development interface provided by a host client device system developer can be directly called by the person in the field, a third party interface can be also automatically developed or called according to actual service needs, a recognition result is that the face exists in front of the screen, a color is randomly selected as the edge profile color according to edge profile data of the target object in the positioning tracking information, further, profile color rendering is performed on the target object in each video frame of the positioning tracking information according to position data of the edge profile in the positioning tracking information, so that the video stream is played in front of the video frame, and the target object is tracked and highlighted, if the target object is highlighted, the target object is the profile of a plurality of target objects is displayed in the image display area (300 b) in the image area (300 b).
In addition, when detecting the audio input device of the anchor client device in the terminal device of the anchor user, new audio data is input, the highlighting of the current target object is stopped, and steps S1100-S1400 are performed to highlight the outline of the corresponding target object in the image stream presentation area 200 in the live broadcasting room image user interface according to the new audio data.
Furthermore, starting timing after the execution of the step S1100, detecting that the audio input device of the anchor client device in the terminal device of the anchor user has no new input audio data, continuously executing the steps S1100-S1400 to maintain the highlighting of the current target object until reaching the preset time period, stopping the highlighting of the current target object, at the same time, popping up a corresponding prompt popup window in the live broadcasting room image user interface of the client of the current anchor user, prompting the anchor user that the highlighting state of the current target object has timed out, automatically closing after the prompt popup window is displayed for 3 seconds, or directly closing when the anchor user clicks the region beyond the prompt popup window in the current live broadcasting room image user interface, wherein the preset time period can be flexibly set by a person skilled in the art according to practical operation effects.
According to the method and the device for processing the video and the audio data, the live broadcast stream synthesized by the graphic stream and the audio stream generated by the live broadcast user is pushed to the live broadcast room through the server, after the live broadcast stream is received by the user connected to the live broadcast room, voice recognition is carried out on the audio data in the live broadcast room to obtain corresponding dictation text, and accordingly the target object pointed by the target object is determined.
Referring to fig. 4, in a further embodiment, the step S1100 of pushing the live stream to the live room includes the following steps:
Step S1110, obtaining an image stream corresponding to a display interface of a third party program from a video memory;
when a client of the live user invokes a preset method function and acquires an operating third party program from a video memory module of a terminal device of the live user in real time, a corresponding image stream is displayed on a display interface of the terminal device, and the preset method function can be a packaged computer instruction, an interface function provided by a device developer and the like and can be flexibly set by a person skilled in the art.
Step S1120, receiving video data captured by an image capturing apparatus connected to a hosting client apparatus;
the camera equipment can be a built-in camera module group of the anchor client equipment or one of equipment such as a mobile phone, a tablet, a camera, a video recorder, a computer and the like which are connected with the anchor client equipment in a Bluetooth manner and are electrically connected with the anchor client equipment.
The client of the live user invokes a camera device connected to the anchor client device, and the camera device may transmit the captured video data to the anchor client device through a related transmission protocol, such as a bluetooth protocol, an HTTP protocol, a usb protocol, etc., so that the client obtains the video data.
Step S1130, receiving audio data input by an audio input device connected to the anchor client device;
The audio input device can be a built-in sound card module group of the anchor client device or one of a microphone, an earphone and the like which are connected with Bluetooth of the anchor client device and electrically connected with the anchor client device.
The client of the live user invokes an audio device connected to the anchor client device, which may transmit the input audio data to the anchor client device via a related transmission protocol, such as a bluetooth protocol, an HTTP protocol, a usb protocol, etc., so that the client obtains the audio data.
Step S1140, synthesizing the image stream and the video data into a video stream, synthesizing the video stream and the audio data into the live stream, and pushing the live stream to a live broadcasting room for playing.
In an embodiment, the client of the live user uploads the image stream, the video data and the audio data to a server of a corresponding network live platform, the server respectively encodes the image stream, the video data and the audio data so as to facilitate subsequent storage or data transmission, the encoded image stream and the encoded video data are set at an identifier played in a play area corresponding to a play interface of a live broadcasting room, then a video stream is synthesized, and further the video stream and the encoded audio data are synthesized into the live stream.
In another embodiment, the client of the live user encodes the image stream, the video data and the audio data locally and correspondingly, sets the encoded image stream and the video data to the identifier played in the playing area corresponding to the playing interface of the live room, synthesizes the video stream and the encoded audio data into the live stream, and uploads the live stream to the server.
The server distributes the live stream to the client of the audience user in the live broadcasting room, decodes the live stream correspondingly after receiving the live stream, and then transmits the live stream to the player to play on the playing interface of the live broadcasting room
In the embodiment, the audio and video data pushed by the anchor user side are encoded and one path of data is synthesized, so that the bandwidth required by data transmission is saved, the data transmission efficiency is greatly improved, and the method is more suitable for the service scene of high-frequency data transmission.
Referring to fig. 5, in a further embodiment, the step S1200 of performing speech recognition on the audio data to obtain a corresponding spoken text, and determining a target object to which the spoken text points includes the following steps:
step S1210, extracting deep acoustic features of the audio data, and constructing corresponding acoustic feature vectors;
Preprocessing the audio data, relatively enhancing the sound by eliminating noise, channel distortion and the like, then framing the sound therein, segmenting the sound into segments in a mode that the frame length is larger than the frame shift length, so that certain overlapping exists between the frames, converting a voice signal from a time domain to a frequency domain, and then carrying out deep acoustic feature extraction on the sound of each frame by adopting linear predictive coding or Mel frequency cepstrum coefficients to construct acoustic feature vectors corresponding to the sound waveforms of each frame.
Step S1220, calling a first neural network model according to the acoustic feature vector to obtain a corresponding phoneme sequence, and decoding the phoneme sequence to obtain the dictation text;
the first neural network model is an acoustic model, and can be selected from an LSTM+CTC model, a context-dependent deep neural network-hidden Markov model (CD-DNN-HMM), a Gaussian mixture model-hidden Markov model (GMM-HMM) and the like.
In one embodiment, the lstm+ctc model is called to take the acoustic feature vector as input, a corresponding variable length feature sequence, namely a phoneme sequence, is output according to a preset phoneme set, the phoneme set is a set which is formed by a set of 39 phonemes and mainly comprises 26 english characters and space characters for english, the set is formed by initial consonants and final consonants of chinese pinyin for chinese, the corresponding phonemes can be flexibly set by a person skilled in the art according to the actual service requirement, a language model which is pre-trained to converge according to the phoneme sequence is called to obtain a plurality of hypothesized word sequences, the language model can adopt an N-Gram language model and a language model based on RNN, further, a decoder is called to calculate an acoustic model score and a language model score corresponding to the phoneme sequence and the plurality of hypothesized word sequences, a Viterbi algorithm is recommended to search an optimal path, and the sequence with the highest total output score is used as the dictation text, and the decoder can be flexibly selected by the person skilled in the art.
Step S1230, matching the dictation text according to the object text information in the preset information list, and obtaining the object text information matched with the dictation text, so as to confirm the target object.
And matching the dictation text with a target object text in a preset information list, further confirming the target object according to the mapping relation between the matched target object text and the unique identification code of the target object, and storing the unique identification code of each target object and the corresponding target object text information in the information list.
In this embodiment, the spoken voice of the anchor user is converted into the corresponding text, and the text information of the target object is screened out to determine the target object.
Referring to fig. 6, in an extended embodiment, the step S1230 of matching the spoken text according to the object text information in the preset information list to obtain the object text information matched with the spoken text, and before confirming the target object, the method includes the following steps:
Step S1231, obtaining entries corresponding to descriptions of competitive items, wherein the competitive items comprise game items or sports competition items;
in one embodiment, for a game item, a get request is sent through an interface provided by a third party network platform for developing and maintaining the game item to obtain entries corresponding to various descriptions in the game item, such as character Chinese/English entries, character skill Chinese/English entries, map Chinese/English entries, equipment Chinese/English entries and the like; for sports competition items, various entries corresponding to descriptions in the sports competition items, such as player personal basic information entries, team member numbering entries, and the like, are obtained by searching document data provided by a competition host network or by adopting a search engine to search hundred-degree entries, wikipedia entries, and the like
Step S1232, screening out object text information corresponding to the character names of the target objects participating in the athletic project and the character skill names;
for the game item, performing weight removal and classification on the vocabulary entry acquired in the step S1231, and then screening the vocabulary entry corresponding to the character name and the character skill name of the target object from the vocabulary entry, thereby obtaining the corresponding object text information; for the sports competition item, the entry collected in step S1231 is subjected to weight removal and classification, then the entry corresponding to the contestant name and contestant number of the target object is screened out therefrom, and further the corresponding object text information is obtained, wherein the target object can be flexibly set by a person skilled in the art according to the actual business requirement, and the object text information corresponding to the set target object according to the object attribute required by the actual business is screened out according to the corresponding competition item.
Step S1233, storing the object text information associated with the corresponding target object in the information list.
And creating a data set with a data structure of an array as the information list, creating a unique identification code for the target object, and storing the associated corresponding object text information into the information list, so that the corresponding unique identification code can be obtained by traversing the information list to match the dictation text of the live user, thereby confirming the target object.
In this embodiment, the creating information list is edited, and the target object in the spoken text of the anchor user can be accurately and quickly confirmed by matching the information list, so that compared with the case that a series of text character filtering operations such as spoken language, word of language and the like are performed on the spoken text of the anchor user, the remaining target object text is obtained, and the target object is confirmed by the former technology, which is simple and efficient.
Referring to fig. 7, in a further embodiment, the step S1300 of identifying the target object from the image stream and obtaining edge contour information of the target object in a video frame of the video stream includes the following steps:
step 1310, extracting deep picture features of each video frame in the video stream, and constructing corresponding picture feature vectors;
In one embodiment, semantic feature extraction is performed on pictures of each video frame in the video stream through a pre-trained to converged ResNet101 residual neural network model, so that semantic features of multiple scales are obtained, further, a connecting layer in the ResNet101 is removed, the pre-trained to converged FPN feature pyramid network model is called, and deep picture features are obtained by downsampling of high-level features on the basis of the multi-scale semantic features, so that corresponding picture feature vectors are constructed.
Step S1320, calling a second neural network model according to the picture feature vector to identify the target object in the video frame of the video stream, and obtaining the real-time position of the target object in the video frame of the video stream;
the second neural network model is an object detection model, such as a Faster-RCNN model, a Yolov4 model and the like, and can be flexibly selected by a person skilled in the art.
In one embodiment, invoking a fast-RCNN object detection model pre-trained to converge according to the picture feature vector, dividing a picture of each video frame of the video stream into a foreground and a background, namely an object or not, judging through a softmax layer to obtain each object in the foreground, further identifying a target object, and obtaining a real-time position of the target object in the picture of each video frame of the video stream.
Step 1330, a third neural network model is called to segment out a picture feature vector corresponding to the target object, and edge compensation calculation is performed on the picture feature vector to obtain edge contour information of the target object, wherein the edge contour information comprises an edge contour corresponding to the target object and a corresponding real-time position of the edge contour.
The third neural network model is an example segmentation model, such as DeepMask, mask-RCNN model, and the like, and can be flexibly selected by a person skilled in the art.
In one embodiment, invoking a Mask-RCNN object segmentation model pre-trained to converge according to the real-time position of the target object, performing instance segmentation on the target object to construct a corresponding picture feature vector, performing edge compensation calculation on the picture feature vector, and setting a random color for the pixel block with the edge compensation, thereby enabling the edge contour pixel of the target object to be large enough to be clearly displayed in the picture of the video frame of the video stream, and finally, outputting edge contour data of the target object in the picture of each video frame of the video stream and the real-time position of the edge contour, thereby constructing the edge contour information.
In this embodiment, the target object in the image stream is identified, the corresponding edge contour is segmented, and finally, the edge contour is set to display the edge contour of the target object on the picture of each video frame of the image stream, so that compared with the general method of building a square frame surrounding around the target object, the method has higher fitting degree on the target object, greatly improves the frame selection effect, and enables the target object to be clearly identified through the frame selection effect.
Referring to fig. 8, in a further embodiment, the step S1400 of pushing the edge profile information to the live broadcast room as positioning tracking information to make a terminal device receiving the positioning tracking information highlight the profile of the target object in a playing interface of the video stream includes the following steps:
step S1410, associating the edge profile information corresponding to the target object to obtain a time stamp corresponding to a video frame in the video stream to form positioning tracking information, uploading the positioning tracking information to a server, and pushing the positioning tracking information to the live broadcasting room, so that the server can send the positioning tracking information to a terminal device connected to the live broadcasting room;
when the audience or the anchor of the live broadcasting room is connected with the live broadcasting room, downloading the target detection and instance segmentation model and the information list which are pre-trained to be converged, thereby, when the live broadcasting stream pushed by the server is received, the steps S1100-S1300 can be executed to obtain the edge profile information corresponding to the target object, and when the edge profile information is obtained, the timestamp and the edge profile information form a timestamp corresponding to the picture of the video frame corresponding to the video stream, the positioning tracking information is uploaded to the server, and then the positioning tracking information is pushed to the audience or the anchor terminal equipment of the live broadcasting room by the server, so that the positioning tracking information can be directly used without self-generation no matter the terminal equipment of the audience which is just connected with the live broadcasting room or the terminal equipment of the live broadcasting room which is executing the steps S1100-S1300.
Step S1420, the detection terminal device displays the start state of the tracking object switch, if the state is detected to be on, the edge contour color is rendered for the target object according to the positioning tracking information, so that the contour of the target object is highlighted in the playing interface of the video stream.
Detecting the starting state of a switch of a display tracking object in the terminal equipment, if the detecting state is closed, receiving the positioning tracking information, storing the positioning tracking information in a cache so as to be quickly called when a subsequent switch is opened, and simultaneously prompting the setting of the switch of the display tracking object by a bottom popup window as shown in a graph (a) 300a of fig. 3, wherein a user can know the setting details and the starting demonstration effect of the switch of the display tracking object through corresponding prompts, and can set the switch of the display tracking object according to the requirement through a graph (a) 301a of fig. 3; if the detection state is on, according to the corresponding video frame in the video stream, the real-time position of the edge contour of the target object is located, so that a color is randomly selected as the edge contour color for rendering, and the edge contour is highlighted on the playing interface of the video stream as shown in fig. 3 (b) 301b.
In the embodiment, on one hand, positioning tracking information is generated through terminal equipment connected with a live broadcasting room and is uploaded to a server, and then the positioning tracking information is issued to each terminal equipment connected with the live broadcasting room by the server, so that each live broadcasting room connected with the live broadcasting room can obtain the positioning tracking information, the outline of a target object can be rendered and displayed on a playing interface, loading time is shortened, and a user can see effects faster; on one hand, by setting up a switch for displaying the tracking object, the requirements of different users are met, and user experience is improved through humanized setting.
Referring to fig. 9, in a preferred embodiment, the step S1420 of rendering edge contour color for the target object according to the positioning tracking information so as to highlight the contour of the target object in the playing interface of the video stream includes the following steps:
step S1421, locating the edge contour in the edge contour information in the video frame of the video stream according to the real-time position in the edge contour information in the positioning and tracking information, and extracting the peripheral color of the edge contour in the video frame;
and positioning the video frame of the video stream according to the timestamp in the positioning tracking information, further, collecting the peripheral color of the edge contour in the video frame according to the real-time position of the edge contour in the positioning tracking information, converting the peripheral color into a corresponding RGB color value or hexadecimal color code, and counting the proportion of each color relative to the total color.
Step S1422, confirming the corresponding color gamut according to the color value with the highest ratio in the peripheral colors of the edge contour, and obtaining the color value which is different from the color gamut and is set as the edge contour color of the target object;
and screening out the color value corresponding to the color with the highest duty ratio according to the total color duty ratio corresponding to each color, and determining the color gamut corresponding to the color value according to the color value range corresponding to different color gamuts, thereby selecting the color gamut different from the color gamut, and selecting the color corresponding to any one color value as the edge contour color of the target object.
Step S1423, rendering the edge profile of the video frame by using the edge profile color of the target object, so as to display the edge profile of the target object in the playing interface of the video stream.
Rendering the outline of the target object according to the edge outline color, so that the outline is displayed in a playing interface of the video stream to display the outline of the target object.
According to the method, the device and the system, the edge outline of the target object is avoided, the color gamut which is different from the color gamut corresponding to the color value near the image of the video frame is selected, the outline color is contrasted with the background color, and the outline display effect is improved.
Referring to fig. 10, a live stream object tracking device provided in the present application is adapted to perform functional deployment by a live stream object tracking method in the present application, and includes: the live broadcast stream pushing module 1100, the voice translation module 1200, the image recognition module 1300 and the outline display module 1400, wherein the live broadcast stream pushing module 1100 is used for pushing a live broadcast stream to a live broadcast room, the live broadcast stream comprises a video stream and an audio stream, the audio stream comprises audio data input by audio input equipment, and the video stream comprises an image stream corresponding to a display interface of a third-party program; the speech translation module 1200 is configured to perform speech recognition on the audio data, obtain a corresponding spoken text, and determine a target object to which the spoken text points; the image recognition module 1300 is configured to recognize the target object from the image stream, and obtain edge contour information of the target object in a video frame of the video stream; the profile display module 1400 is configured to push the edge profile information to the live broadcast room as positioning tracking information, so that a terminal device that receives the positioning tracking information highlights the profile of the target object in a playing interface of the video stream.
In a further embodiment, the live stream pushing module 1100 includes: the image stream acquisition sub-module is used for acquiring the image stream corresponding to the display interface of the third-party program from the video memory; the video data receiving sub-module is used for receiving video data shot by the camera equipment connected with the anchor client equipment; the audio data receiving sub-module is used for receiving audio data input by audio input equipment connected with the anchor client equipment; and the video stream synthesis sub-module is used for synthesizing the image stream and the video data into a video stream, synthesizing the video stream and the audio data into the live stream, and pushing the live stream to a live broadcasting room for playing.
In a further embodiment, the speech translation module 1200 includes: the feature extraction sub-module is used for extracting deep acoustic features of the audio data and constructing corresponding acoustic feature vectors; the decoding submodule is used for calling the first neural network model according to the acoustic feature vector to obtain a corresponding phoneme sequence, and decoding the phoneme sequence to obtain the dictation text; and the target object confirmation sub-module is used for matching the dictation text according to the object text information in the preset information list, obtaining the object text information matched with the dictation text and confirming the target object by the target object confirmation sub-module.
In an extended embodiment, before the target object confirms the sub-module, the method includes the following steps: the entry acquisition unit is used for acquiring entries corresponding to the descriptions of the competitive items, wherein the competitive items comprise game items or sports competition items; the text information screening unit is used for screening object text information corresponding to the character names and the character skill names of the target objects participating in the athletic project; and the storage unit is used for storing the object text information associated with the corresponding target object in the information list.
In a further embodiment, the image recognition module 1300 includes: the feature extraction submodule is used for extracting deep picture features of each video frame in the video stream and constructing corresponding picture feature vectors; the target object identification sub-module is used for calling a second neural network model according to the picture feature vector to identify the target object in the video frame of the video stream and obtaining the real-time position of the target object in the video frame of the video stream; and the edge profile information sub-module is used for calling a third neural network model to segment out a picture feature vector corresponding to the target object, performing edge compensation calculation on the picture feature vector to obtain edge profile information of the target object, wherein the edge profile information comprises an edge profile corresponding to the target object and a corresponding real-time position of the edge profile.
In a further embodiment, the profile display module 1400 includes: the positioning tracking information sub-module is used for correlating the edge profile information corresponding to the target object to obtain a time stamp corresponding to a video frame in the video stream to form positioning tracking information, uploading the positioning tracking information to a server and pushing the positioning tracking information to the live broadcasting room so that the server can send the positioning tracking information to terminal equipment connected with the live broadcasting room; and the edge contour rendering sub-module is used for detecting the starting state of a switch of a display tracking object of the terminal equipment, and rendering edge contour color for the target object according to the positioning tracking information if the state is detected to be on, so that the contour of the target object is highlighted in a playing interface of the video stream.
In a preferred embodiment, the edge contour rendering sub-module includes: the color acquisition unit is used for positioning the edge contour in the edge contour information in the video frame of the video stream according to the real-time position in the edge contour information in the positioning and tracking information and extracting the peripheral color of the edge contour in the video frame; a color confirmation unit, configured to confirm a color gamut corresponding to the edge contour according to a color value with the highest ratio in the peripheral colors of the edge contour, and acquire a color value different from the color gamut as an edge contour color of the target object; and the color rendering unit is used for rendering the edge contour of the video frame by adopting the edge contour color of the target object so as to display the edge contour of the target object in a playing interface of the video stream.
A computer device provided in accordance with one of the objects of the present application comprises a central processor and a memory, the central processor being adapted to invoke the steps of running a computer program stored in the memory to perform the live stream object tracking method described herein.
A computer readable storage medium adapted for the purposes of the present application stores in the form of computer readable instructions a computer program implemented according to the live stream object tracking method, which when invoked by a computer, performs the steps comprised by the method.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. As shown in fig. 11, the internal structure of the computer device is schematically shown. The computer device includes a processor, a computer readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and when the computer readable instructions are executed by a processor, the processor can realize a live stream object tracking method. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may store computer readable instructions that, when executed by the processor, may cause the processor to perform the live stream object tracking method of the present application. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
The processor in this embodiment is configured to execute specific functions of each module and its sub-module in fig. 10, and the memory stores program codes and various data required for executing the above-mentioned modules or sub-modules. The network interface is used for data transmission between the user terminal or the server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the live gift resource update of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the live stream object tracking method of any of the embodiments of the present application.
Those skilled in the art will appreciate that implementing all or part of the above-described methods of embodiments of the present application may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed, may comprise the steps of embodiments of the methods described above. The storage medium may be a computer readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
In summary, according to the method and the device, through intelligent voice recognition, target detection, instance segmentation and contour labeling modes, the contours of the target objects corresponding to the spoken voices of the live broadcast users are labeled in real time in the video stream played in the live broadcast room, and the readability of live broadcast contents in a user interface is greatly improved, so that user experience is improved.
Those of skill in the art will appreciate that the various operations, methods, steps in the flow, actions, schemes, and alternatives discussed in the present application may be alternated, altered, combined, or eliminated. Further, other steps, means, or steps in a process having various operations, methods, or procedures discussed in this application may be alternated, altered, rearranged, split, combined, or eliminated. Further, steps, measures, schemes in the prior art with various operations, methods, flows disclosed in the present application may also be alternated, altered, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1. The live stream object tracking method is characterized by comprising the following steps of:
pushing a live stream to a live broadcasting room, wherein the live stream comprises a video stream and an audio stream, the audio stream comprises audio data input by audio input equipment, and the video stream comprises an image stream corresponding to a display interface of a third-party program;
performing voice recognition on the audio data to obtain a corresponding dictation text, and determining a target object pointed by the dictation text;
identifying the target object from the image stream, and acquiring edge contour information of the target object in a video frame of the video stream;
pushing the edge profile information serving as positioning tracking information to the live broadcasting room, so that terminal equipment receiving the positioning tracking information highlights the profile of the target object in a playing interface of the video stream.
2. The live stream object tracking method according to claim 1, wherein pushing the live stream to the live room comprises the steps of:
acquiring an image stream corresponding to a display interface of a third-party program from a video memory;
receiving video data shot by camera equipment connected with a host client device;
receiving audio data input by an audio input device connected with a host client device;
And synthesizing the image stream and the video data into a video stream, synthesizing the video stream and the audio data into the live stream, and pushing the live stream to a live broadcasting room for playing.
3. The method for tracking a live stream object according to claim 1, wherein the step of performing voice recognition on the audio data to obtain a corresponding spoken text and determining a target object to which the spoken text is directed comprises the steps of:
extracting deep acoustic features of the audio data, and constructing corresponding acoustic feature vectors;
calling a first neural network model according to the acoustic feature vector to obtain a corresponding phoneme sequence, and decoding the phoneme sequence to obtain the dictation text;
and matching the dictation text according to object text information in a preset information list, and obtaining object text information matched with the dictation text so as to confirm the target object.
4. A live stream object tracking method according to claim 3, wherein matching the spoken text according to object text information in a preset information list, obtaining object text information matching the spoken text, and before confirming the target object with it, comprises the steps of:
Acquiring entries corresponding to descriptions of competitive items, wherein the competitive items comprise game items or sports competition items;
screening out object text information corresponding to the character names of target objects participating in the athletic project and the character skill names;
and storing the object text information associated with the corresponding target object in the information list.
5. The live stream object tracking method according to claim 1, wherein the identifying the target object from the image stream, and acquiring edge profile information of the target object in a video frame of the video stream, comprises the steps of:
extracting deep picture features of each video frame in the video stream, and constructing corresponding picture feature vectors;
invoking a second neural network model according to the picture feature vector to identify the target object in the video frame of the video stream, and obtaining the real-time position of the target object in the video frame of the video stream;
and calling a third neural network model to segment out a picture feature vector corresponding to the target object, and performing edge compensation calculation on the picture feature vector to obtain edge contour information of the target object, wherein the edge contour information comprises an edge contour corresponding to the target object and a corresponding real-time position of the edge contour.
6. The live stream object tracking method according to claim 1, wherein the edge profile information is pushed to the live broadcasting room as positioning tracking information, so that a terminal device receiving the positioning tracking information highlights the profile of the target object in a playing interface of the video stream, and the method comprises the following steps:
the edge profile information corresponding to the target object is correlated with the time stamp of the video frame for obtaining the edge profile information to form positioning tracking information, and the positioning tracking information is uploaded to a server to be pushed to the live broadcasting room so that the server can send the positioning tracking information to terminal equipment connected with the live broadcasting room;
and the detection terminal equipment displays the starting state of the tracking object switch, and if the state is detected to be on, the edge contour color is rendered for the target object according to the positioning tracking information, so that the contour of the target object is highlighted in the playing interface of the video stream.
7. The live stream object tracking method according to claim 6, wherein rendering edge contour colors for the target object according to the positioning tracking information such that the contour of the target object is highlighted in the playback interface of the video stream, comprising the steps of:
Positioning the edge contour in the edge contour information in the video frame of the video stream according to the real-time position in the edge contour information in the positioning and tracking information, and extracting the peripheral color of the edge contour in the video frame;
confirming a corresponding color gamut according to the color value with the highest duty ratio in the peripheral colors of the edge contour, and acquiring the color value which is different from the color gamut and is set as the edge contour color of the target object;
and rendering the edge contour of the video frame by adopting the edge contour color of the target object so as to display the edge contour of the target object in a playing interface of the video stream.
8. A live stream object tracking apparatus, comprising:
the live broadcast stream pushing module is used for pushing a live broadcast stream to a live broadcast room, wherein the live broadcast stream comprises a video stream and an audio stream, the audio stream comprises audio data input by audio input equipment, and the video stream comprises an image stream corresponding to a display interface of a third-party program;
the voice translation module is used for carrying out voice recognition on the audio data to obtain a corresponding dictation text, and determining a target object pointed by the dictation text;
The image identification module is used for identifying the target object from the image stream and acquiring edge contour information of the target object in a video frame of the video stream;
and the contour display module is used for pushing the edge contour information serving as positioning tracking information to the live broadcasting room, so that the terminal equipment receiving the positioning tracking information highlights the contour of the target object in the playing interface of the video stream.
9. A computer device comprising a central processor and a memory, characterized in that the central processor is arranged to invoke a computer program stored in the memory for performing the steps of the method according to any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores in the form of computer-readable instructions a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.
CN202210106703.7A 2022-01-28 2022-01-28 Live stream object tracking method, device, equipment and medium thereof Active CN114401417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210106703.7A CN114401417B (en) 2022-01-28 2022-01-28 Live stream object tracking method, device, equipment and medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210106703.7A CN114401417B (en) 2022-01-28 2022-01-28 Live stream object tracking method, device, equipment and medium thereof

Publications (2)

Publication Number Publication Date
CN114401417A CN114401417A (en) 2022-04-26
CN114401417B true CN114401417B (en) 2024-02-06

Family

ID=81232229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210106703.7A Active CN114401417B (en) 2022-01-28 2022-01-28 Live stream object tracking method, device, equipment and medium thereof

Country Status (1)

Country Link
CN (1) CN114401417B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333868A (en) * 2022-06-24 2024-01-02 华为云计算技术有限公司 Method, device and storage medium for identifying object
CN114845133B (en) * 2022-06-30 2022-09-16 南京联迪信息系统股份有限公司 Man-machine interaction method and system for online live broadcast
CN115190339B (en) * 2022-09-13 2024-04-30 北京达佳互联信息技术有限公司 Live broadcast information sending method and device, electronic equipment and storage medium
CN115767113B (en) * 2022-09-22 2023-09-01 北京国际云转播科技有限公司 Cloud rebroadcasting method, device, medium and system
CN116993376B (en) * 2023-08-14 2024-03-12 深圳数拓科技有限公司 Intelligent data interaction method and system for unmanned restaurant

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105791958A (en) * 2016-04-22 2016-07-20 北京小米移动软件有限公司 Method and device for live broadcasting game
CN107423274A (en) * 2017-06-07 2017-12-01 北京百度网讯科技有限公司 Commentary content generating method, device and storage medium based on artificial intelligence
CN108833969A (en) * 2018-06-28 2018-11-16 腾讯科技(深圳)有限公司 A kind of clipping method of live stream, device and equipment
CN109543102A (en) * 2018-11-12 2019-03-29 百度在线网络技术(北京)有限公司 Information recommendation method, device and storage medium based on video playing
US10248866B1 (en) * 2018-01-17 2019-04-02 Gopro, Inc. Systems and methods for identifying video highlights based on audio
CN110958416A (en) * 2019-12-06 2020-04-03 佳讯飞鸿(北京)智能科技研究院有限公司 Target tracking system and remote tracking system
CN113301372A (en) * 2021-05-20 2021-08-24 广州繁星互娱信息科技有限公司 Live broadcast method, device, terminal and storage medium
CN113350783A (en) * 2021-05-21 2021-09-07 广州博冠信息科技有限公司 Game live broadcast method and device, computer equipment and storage medium
CN113453022A (en) * 2021-06-30 2021-09-28 康佳集团股份有限公司 Image display method and device, television and storage medium
CN113849687A (en) * 2020-11-23 2021-12-28 阿里巴巴集团控股有限公司 Video processing method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0004499D0 (en) * 2000-02-26 2000-04-19 Orad Hi Tec Systems Ltd Television illustration system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105791958A (en) * 2016-04-22 2016-07-20 北京小米移动软件有限公司 Method and device for live broadcasting game
CN107423274A (en) * 2017-06-07 2017-12-01 北京百度网讯科技有限公司 Commentary content generating method, device and storage medium based on artificial intelligence
US10248866B1 (en) * 2018-01-17 2019-04-02 Gopro, Inc. Systems and methods for identifying video highlights based on audio
CN108833969A (en) * 2018-06-28 2018-11-16 腾讯科技(深圳)有限公司 A kind of clipping method of live stream, device and equipment
CN109543102A (en) * 2018-11-12 2019-03-29 百度在线网络技术(北京)有限公司 Information recommendation method, device and storage medium based on video playing
CN110958416A (en) * 2019-12-06 2020-04-03 佳讯飞鸿(北京)智能科技研究院有限公司 Target tracking system and remote tracking system
CN113849687A (en) * 2020-11-23 2021-12-28 阿里巴巴集团控股有限公司 Video processing method and device
CN113301372A (en) * 2021-05-20 2021-08-24 广州繁星互娱信息科技有限公司 Live broadcast method, device, terminal and storage medium
CN113350783A (en) * 2021-05-21 2021-09-07 广州博冠信息科技有限公司 Game live broadcast method and device, computer equipment and storage medium
CN113453022A (en) * 2021-06-30 2021-09-28 康佳集团股份有限公司 Image display method and device, television and storage medium

Also Published As

Publication number Publication date
CN114401417A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN114401417B (en) Live stream object tracking method, device, equipment and medium thereof
CN110288077B (en) Method and related device for synthesizing speaking expression based on artificial intelligence
CN112131988B (en) Method, apparatus, device and computer storage medium for determining virtual character lip shape
CN110379430B (en) Animation display method and device based on voice, computer equipment and storage medium
JP7312853B2 (en) AI-BASED VOICE-DRIVEN ANIMATION METHOD AND APPARATUS, DEVICE AND COMPUTER PROGRAM
CN112088402A (en) Joint neural network for speaker recognition
CN112040263A (en) Video processing method, video playing method, video processing device, video playing device, storage medium and equipment
CN110517689B (en) Voice data processing method, device and storage medium
WO2019037615A1 (en) Video processing method and device, and device for video processing
CN110808034A (en) Voice conversion method, device, storage medium and electronic equipment
KR20210007786A (en) Vision-assisted speech processing
CN112309365B (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN109859298B (en) Image processing method and device, equipment and storage medium thereof
CN110322760A (en) Voice data generation method, device, terminal and storage medium
CN109429078A (en) Method for processing video frequency and device, for the device of video processing
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN112738557A (en) Video processing method and device
CN109429077A (en) Method for processing video frequency and device, for the device of video processing
CN114125506B (en) Voice auditing method and device
CN116229311B (en) Video processing method, device and storage medium
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN116756285A (en) Virtual robot interaction method, device and storage medium
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
CN113891150A (en) Video processing method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant