CN114612819A

CN114612819A - Football event detection method, equipment and storage medium

Info

Publication number: CN114612819A
Application number: CN202210156457.6A
Authority: CN
Inventors: 王景文; 周卫; 司季雨
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-06-10

Abstract

The invention discloses a method, a device and a storage medium for detecting a football event, wherein the method comprises the following steps: acquiring program data with the content of the football game, wherein the program data comprises original video data and original audio data; identifying an event occurring in the football event according to the characteristics of the original video data and the characteristics of the original audio data, and using the event as a first reference event; identifying an event occurring in the football event according to elements displayed in the original video data as a second reference event; identifying an event occurring in the football event according to the language expressed by the original audio data as a third reference event; and marking the event which occurs in the football event as the target event according to the program data in at least one of the first reference event, the second reference event and the third reference event. The advantages of different technologies can be fully exerted, the disadvantages of different technologies are inhibited, and the accuracy of detecting the events occurring in the football match is improved.

Description

Football event detection method, equipment and storage medium

Technical Field

The present invention relates to the field of live broadcast technologies, and in particular, to a method, a device, and a storage medium for detecting a football event.

Background

A football game is a favorite sport that people often watch, for example, a game between football clubs, a game between national football teams, etc.

Because the time of the football game is long, generally 2 hours or more, people do not need to watch the football game all the time, but pay attention to specific events in the football game, such as corner ball, yellow card, point ball, person changing, and the like, and the events can obviously influence the victory or defeat of the football game.

In order to facilitate users to quickly search out the events in the football game, at present, features are mostly extracted from videos of the football game, and semantics in the football game are identified according to the features, so that the events in the football game are identified according to the semantics.

However, there is a significant gap between the features of the video bottom layer and the semantics of the video top layer, so that the accuracy of recognizing the semantics in the soccer event is low, and the accuracy of recognizing the event in the soccer event is low.

Disclosure of Invention

The invention provides a football event detection method, equipment and a storage medium, which aim to solve the problem of low accuracy of identifying an event in a football event.

According to an aspect of the present invention, there is provided a method for detecting a soccer event, including:

acquiring program data with the content of a football event, wherein the program data comprises original video data and original audio data;

identifying an event occurring in the football event according to the characteristics of the original video data and the characteristics of the original audio data, and using the event as a first reference event;

identifying an event occurring in the football event as a second reference event according to elements displayed in the original video data;

identifying an event occurring in the football event according to the language expressed by the original audio data as a third reference event;

marking events occurring in the football event as target events for the program data according to at least one of the first reference event, the second reference event and the third reference event.

According to another aspect of the present invention, there is provided a soccer event detection apparatus, including:

the system comprises a program data acquisition module, a data processing module and a data processing module, wherein the program data acquisition module is used for acquiring program data of a football match, and the program data comprises original video data and original audio data;

a first reference event identification module, configured to identify an event occurring in the football event according to the features of the original video data and the features of the original audio data, as a first reference event;

a second reference event identification module, configured to identify, according to an element displayed in the original video data, an event occurring in the soccer event as a second reference event;

a third reference event identification module, configured to identify an event occurring in the soccer event according to the language expressed by the original audio data, as a third reference event;

a target event marking module, configured to mark, as a target event, an event occurring in the football event to the program data according to at least one of the first reference event, the second reference event, and the third reference event.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method of detecting a soccer event according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a method for detecting a soccer event according to any one of the embodiments of the present invention when the computer instructions are executed.

In this embodiment, program data of a soccer event is obtained, where the program data includes original video data and original audio data; identifying an event occurring in the football event according to the characteristics of the original video data and the characteristics of the original audio data, and using the event as a first reference event; identifying an event occurring in the football event according to elements displayed in the original video data as a second reference event; identifying an event occurring in the football event according to the language expressed by the original audio data as a third reference event; and marking the event which occurs in the football event as the target event according to the program data in at least one of the first reference event, the second reference event and the third reference event. The event that this embodiment used original video data, original audio data prediction probably to take place in the football match alone or the combination, make full use of existing material, the cost of control material, the quality of different techniques is different, use the event that takes place in the football match finally to confirm according to the circumstances that uses different techniques prediction alone or fuse, can give full play to the advantage of different techniques, restrain the disadvantage of different techniques, improve the accuracy of the event that detects to take place in the football match.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting a soccer event according to an embodiment of the present invention;

FIG. 2 is an architecture diagram for detecting an event occurring in a soccer event according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an exemplary template matching according to an embodiment of the present invention;

fig. 4 is a flowchart of a method for detecting a soccer event according to a second embodiment of the present invention;

fig. 5 is a flowchart of a method for detecting a soccer event according to a third embodiment of the present invention;

fig. 6 is a schematic structural diagram of a football event detection apparatus according to a fourth embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device implementing the method for detecting a soccer event according to the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a football event detection method according to an embodiment of the present invention, where this embodiment is applicable to a situation where an event is tagged to a football event based on multi-mode fusion, and the method may be executed by a football event detection device, and the football event detection device may be implemented in a form of hardware and/or software, and may be configured in an electronic device. As shown in fig. 1, the method includes:

step 101, obtaining program data with the content of the football game.

In this embodiment, the content of the program data may be collected through different channels, such as uploading by a user and purchasing from a service provider, for a soccer event, where the program data may be live data or recorded data offline, and this embodiment is not limited thereto.

The program data of the online live broadcast can be released in a live broadcast room, and can also be directly released in a web page, which is not limited in this embodiment.

For professional football events, such as the match between football clubs, the match between national football teams and the like, program data of the professional football events are generally provided by professional service providers, the definition of pictures and the definition of tone quality are higher, the content of the pictures and the content of commentary are more standard and rich, and semantic analysis is favorably carried out on the program data.

Of course, for non-professional football events, such as games between enterprise football teams, games between school football teams, etc., the program data may also be provided by individuals, which is not limited in this embodiment.

In a specific implementation, as shown in fig. 2, the program data 210 includes video data and audio data, and for convenience of distinction, the video data is denoted as original video data 211, the audio data is denoted as original audio data 212, and the original video data 211 and the original audio data 212 are aligned on a time axis.

Step 102, identifying an event occurring in the football game as a first reference event according to the characteristics of the original video data and the characteristics of the original audio data.

In a football game, there are usually many events of particular significance to the players during the game of playing football, which may have an effect on the progress of the game, such as the events of the referee's decisions, e.g., hitting a ball, corner ball, foul, direct free ball, indirect free ball, side serve, center serve, out of bounds, offside, yellow, red, etc., the events of the players during the game, e.g., goal, shoot, lap, miss, rescue, kill, oolong, etc., the events of the coach actively triggering the referee, e.g., changing a person, etc.

As shown in fig. 2, for the original video data 211, features may be extracted from the original video data at the data level of the pixel points, for the original audio data 212, features may be extracted from the original audio data at the data level of the audio frame, and an event that may occur in a football event is predicted according to the features of the original video data and the features of the original audio data, and is marked as a first reference event.

In one embodiment of the present invention, step 102 may include the steps of:

step 1021, at least two characteristics are extracted from the original video data as video characteristics.

In this embodiment, at least two networks or at least two algorithms can be used to extract at least two different dimensional features from the original video data, and these features are denoted as video features.

In an example of this embodiment, the video feature includes a first motion feature, a second motion feature, and a time sequence feature, and in this example, step 1021 may further include the steps of:

step 10211, determining a time period network TSN, a time shift model TSM and a video learning network VTN.

In this example, as shown in fig. 2, three networks are preset, and are distributed as a Temporal Segment Network (TSN), a Temporal Shift Model (TSM), and a Video Transform Network (VTN).

In training the three networks, a first data set, a second data set, a third data set, and a fourth data set may be acquired, respectively.

The first data set has a plurality of video data recorded as first sample video data, the second data set has a plurality of video data recorded as second sample video data, the third data set has a plurality of video data recorded as third sample video data, and the fourth data set has a plurality of video data recorded as fourth sample video data.

The first data set, the second data set and the third data set belong to a common data set, for example, one or more of Moments-in-Time, Kinetics-400 and ImageNet-21k, the content of the first sample video data in the first data set, the content of the second sample video data in the second data set and the content of the third sample video data in the third data set have diversity and can be related to or unrelated to a football event, and the content of the first sample video data in the first data set, the content of the second sample video data in the second data set and the content of the third sample video data in the third data set all mark actions.

The fourth data set belongs to a data set special for the football event, such as SoccerNet-v2, and the content of the fourth sample video data in the fourth data set is the football event and marks the event of the football event.

In the fourth sample video data, a certain time length (e.g., 1 second) is randomly extended forward and/or backward based on the timestamp of the event occurring in the soccer event, so that segments of fixed total time length (e.g., 5 seconds) containing the event are extracted from the fourth sample video data and are recorded as video segments, and the video segments belong to positive samples.

The Time segment network TSN provides a network architecture, in this example, the Time segment network TSN may be constructed by using a backbone network of a first residual network (e.g., ResNet1014), perform first pre-training on the Time segment network TSN constructed based on the backbone network of the first residual network by using a first data set (e.g., motion-in-Time), and perform Fine tuning of the Time segment network TSN by using a video clip if the first pre-training is completed.

In this example, the time shift model TSM may be constructed by using a backbone network of a second residual network (e.g., ResNet50), performing second pre-training on the time shift model TSM constructed by using the backbone network based on the second residual network by using a second data set (e.g., Kinetics-400), and performing Fine-tune on the time shift model TSM by using a video clip if the second pre-training is completed.

And performing third pre-training on the video learning network VTN by using a third data set (such as a set of ImageNet-21k and Kinetics-400), and if the third pre-training is completed, Fine-tuning the video learning network VTN by using the video segments.

In this example, different data sets are used to train the network, so that information of different data sets can be utilized to the greatest extent to learn more excellent and robust features.

Step 10212, inputting the original video data into the time slot network TSN to extract the feature representing the motion as the first motion feature.

As shown in fig. 2, the original video data 211 is input to the time slot network TSN, and the time slot network TSN processes the original video data according to the structure thereof, and extracts a feature representing an action from the original video data, which is denoted as a first action feature.

Step 10213, inputting the original video data into the time shift model TSM to extract the feature representing the motion as the second motion feature.

As shown in fig. 2, the original video data 211 is input to the time shift model TSM, and the time shift model TSM processes the original video data according to the structure thereof, and extracts a feature representing an action from the original video data, which is referred to as a second action feature.

Step 10214, inputting the original video data into the video learning network VTN to extract the time-series characteristics as the time-series characteristics.

As shown in fig. 2, the original video data 211 is input to the video learning network VTN, and the video learning network VTN processes the original video data according to its structure, extracts time-series features from the original video data, and records the time-series features as third motion features.

Step 1022, extracting features from the raw audio data as audio features.

In this embodiment, a preset network or algorithm may be used to extract features from the original audio data, and the extracted features are denoted as audio features.

In an example of this embodiment, as shown in fig. 2, a convolutional neural network VGGish may be determined, the original audio data 212 is input into the convolutional neural network VGGish, and the convolutional neural network VGGish processes the original audio data according to its structure, and extracts features from the original audio data as audio features.

And 1023, combining at least two video features and audio features into a live broadcast feature.

In this embodiment, as shown in fig. 2, at least two video features and audio features may be merged and fused 220 to obtain a multi-modal live feature.

In an example of this embodiment, as shown in fig. 2, if the video features include a first motion feature, a second motion feature, and a timing feature, the first motion feature, the second motion feature, and the timing feature may be aligned with the audio features in time, that is, the first motion feature, the second motion feature, the timing feature, and the audio feature extracted for the original video data and the original audio data at the same time are found, and the content is guaranteed to be the same.

The aligned first action characteristic, second action characteristic, time sequence characteristic and audio characteristic are spliced into a live broadcast characteristic, so that the live broadcast characteristic is a multi-mode characteristic, information is rich, and the accuracy of predicting events in the football match can be improved.

The dimensionality of the first action feature output by the time period network TSN is 2048, the dimensionality of the second action feature output by the time shift model TSM is 2048, the dimensionality of the time sequence feature output by the video learning network VTN is 768, the dimensionality of the audio feature output by the convolutional neural network VGGish is 128, and then the dimensionality of the live broadcast feature formed by splicing the first action feature, the second action feature, the time sequence feature and the audio feature is 4992.

And 1024, calculating the probability of each event in the football event based on the live broadcast characteristics.

Specific time in the football game can be classified through live broadcast characteristics, and the classification result is events marked in a data set in advance, so that the probability of each event in the football game is predicted.

For some open-source data sets, events have been calibrated in advance, for example, SoccerNet-v2 calibrates 17 events, which do not necessarily satisfy the service requirement, and at this time, other events may be added on the basis of the events calibrated in advance.

In one example, as shown in fig. 2, a plurality of (e.g., 3) coding layer transformers may be determined, and the target feature may be sequentially input into the plurality of coding layer transformers for coding to output the coding feature.

In this example, the characteristic of the first encoding layer Transformer input is the live broadcast characteristic, the characteristic of the non-first encoding layer Transformer input is the characteristic of the previous encoding layer Transformer output, and the characteristic of the last encoding layer Transformer output is marked as the encoding characteristic.

The functions of Softmax and the like are used for activating the coding features to obtain the probability of each event in the football match

And step 1025, determining events occurring in the football match according to the probability as first reference events.

As shown in fig. 2, by analyzing and processing the probabilities of the events, the event occurring in the soccer event can be determined and recorded as the first reference event.

In a specific implementation, the probability of each event may be thresholded, that is, a probability greater than or equal to a preset probability threshold is extracted as a candidate probability.

And carrying out non-maximum value inhibition on the candidate probability to obtain a target probability, thereby determining an event corresponding to the target probability of the football match as a first reference event.

Step 103, identifying an event occurring in the football event according to the elements displayed in the original video data, and using the event as a second reference event.

Some events occurring in the soccer event are usually added with some elements on the interface to the picture of the soccer event in the post-processing process, so as to facilitate the viewing of the audience, and therefore, as shown in fig. 2, a template may be formulated for these elements in advance, and the elements displayed in the original video data are compared with the template, so as to predict the events that may occur in the soccer event, and record the events as second reference events.

In one embodiment of the present invention, step 103 may comprise the steps of:

step 1031, extracting pixel points located in the specified first range from the original video data to obtain a first image area.

For some events that occur in a soccer game, such as red cards, Yellow cards, etc., the associated text may appear in a particular area, such as "Dismissal" for red cards, "Yellow cards" for Yellow cards, "etc.

In this case, the range where the text appears may be marked in advance and recorded as the first range, so that the pixel points located in the specified first range are extracted from the original video data to obtain the first image region.

In one example, as shown in FIG. 3, for the Yellow Card event, the range 310 in which the keyword "Yellow Card" that prompts the Yellow Card appears may be set to a first range.

Further, these texts are usually added later by the service provider providing the program data, and for a given service provider, the first range of the added texts is usually fixed, so that the first range can be marked in advance for different service providers, and a first mapping relationship between the service provider and the first range can be established.

At this time, a service provider providing program data may be queried, and a first range set to the service provider is queried in the first mapping relationship, where the first range is used to display text related to an event occurring in a football match, and as shown in fig. 2, a pixel point located in the first range is extracted from the original video data 211, so as to obtain a first image area.

And 1032, performing optical character recognition on the first image area to obtain first text information.

As shown in fig. 2, a first image region is extracted from the original video data 211, and Optical Character Recognition (OCR) is performed on the first image region to obtain first text information.

And step 1033, matching the first text message with a preset first reference word.

In this embodiment, the keywords expressing the event of the football game may be selected from the text added by the service provider in advance and recorded as the first reference word (template), that is, the first reference word is the keywords expressing the event of the football game, such as red, Dismissal, Yellow Card, and the like.

For the first text information in the current original video data, the first text information may be matched with the first reference words.

Generally, if the first text information is identical to a certain first reference word, the first text information and the first reference word can be considered to be successfully matched.

In consideration of a certain error rate of optical character recognition, a small number of wrong characters and missed characters may exist, and in order to alleviate negative effects caused by the wrong characters and the missed characters, the embodiment may match the first text information with the first reference word in a fuzzy matching manner.

Specifically, an edit Distance (Levenshtein Distance) between the first text information and a preset first reference word is calculated.

If the editing distance is smaller than or equal to the preset distance threshold, it indicates that the similarity between the first text message and the first reference word is high, and it may be determined that the first text message and the first reference word are successfully matched.

For example, if "Dismissal" in the first image region is optically character-recognized, and "i" is erroneously recognized as "1", the first text information is output as "Dism 1 ssal", and "Dismissal" is set as the first reference word, the edit distance between "Dism 1 ssal" and "Dismissal" is 1, and the matching between the two is successful.

Step 1034, if the matching is successful, determining that an event corresponding to the first reference word occurs in the football match, and using the event as a second reference event.

As shown in fig. 2, if the first text information is successfully matched with a certain first reference word, it may be considered that an event corresponding to the first reference word occurs in the football event and is recorded as a second reference event.

In another embodiment of the present invention, step 103 may comprise the steps of:

in step 1035, pixels in the designated second range are extracted from the original video data to obtain a second image region.

As shown in fig. 2, for some events occurring in a soccer event, such as a red card, a yellow card, a trade, etc., an associated icon (LOGO) appears in a specific range, such as a red card, a yellow card, a trade up and down arrow, etc.

In this case, the range in which the icon appears may be marked in advance and marked as the second range, so that the pixel points located in the specified second range are extracted from the original video data to obtain the second image region.

In one example, as shown in FIG. 3, for the yellow card event, a range 320 that suggests the presence of a yellow card of the yellow card may be set to a second range.

Further, these icons are usually added later by the service provider providing the program data, and the second range of the added icons is usually fixed for a given service provider, so that the second range can be marked in advance for different service providers, and a mapping relationship between the service provider and the second range can be established.

At this time, a service provider providing program data may be queried, and a second range set for the service provider is queried in the second mapping relationship, where the second range is used to display an icon used by an event occurring with a soccer event, as shown in fig. 2, a pixel point located in the second range is extracted from the original video data 211, and a second image region is obtained.

Step 1036, matching the second image area with a preset reference image block.

In this embodiment, the icon added by the service provider may be previously recorded as a reference image block (template), that is, the reference image block is an icon used by an event occurring in the soccer event.

As shown in fig. 2, for the second image area in the current original video data 211, the second image area may be matched with the reference image blocks.

In one matching manner, a Cross Correlation (Cross Correlation) between the second image region and a preset reference image block may be calculated, and if the Cross Correlation is greater than or equal to a preset Correlation threshold, it indicates that the similarity between the second image region and the reference image block is high, and it may be determined that the second image region and the reference image block are successfully matched.

And 1037, if the matching is successful, determining that an event corresponding to the reference image block occurs in the football game, and using the event as a second reference event.

If the second image area is successfully matched with a certain reference image block, an event corresponding to the reference image block in the football match can be regarded as a second reference event.

And 104, identifying an event occurring in the football event according to the language expressed by the original audio data, and using the event as a third reference event.

In some soccer events, especially professional soccer events, commentators are typically provided to explain the soccer event, to parse the techniques in the soccer event, such as explaining the characteristics of each player, the status of the club to which the soccer team belongs, the attack and defense strategies of the soccer team, and so on, to assist the audience in understanding the progress of the soccer event, wherein the commentary of the commentator is recorded in the original audio data.

The language in which the original audio data is expressed is usually the commentator commentary, which may describe some events that may occur in the soccer event, and therefore, the language in which the original audio data is expressed may be referred to predict an event that may occur in the soccer event, which is denoted as a third reference event.

In one embodiment of the present invention, step 104 may include the steps of:

step 1041, performing speech recognition on the original audio data to obtain second text information representing a sentence.

In the present embodiment, as shown in fig. 2, Speech Recognition (ASR) may be performed on the original audio data 212, so as to convert the language expressed by the original audio data into second text information at a sentence level.

Step 1042, comparing the second text information with a preset second reference word.

In this embodiment, the keywords expressing the event occurring in the football game may be screened from the characters used by the commentator in advance, and recorded as the second reference word, that is, the second reference word is the keywords expressing the event occurring in the football game, for example, the second reference word is a keyword for explaining the event occurring in the football game, such as a resolution, a successful destruction, and a defense success.

For the second text information in the current original audio data, the second text information may be compared with these second reference words.

Step 1043, if the second text information includes the second reference word, determining that an event corresponding to the second reference word occurs in the football game, and using the event as a third reference event.

As shown in fig. 2, if the second text information includes a certain second reference word, it may be considered that an event corresponding to the second reference word occurs in the soccer event, and the event is recorded as a third reference event.

In another embodiment of the present invention, step 104 may further include the steps of:

and step 1044, comparing the second text information with a preset third reference word.

In practical applications, the commentator has some fuzzy semantics such as asking and disappearing, for example, when a certain player shoots at a goal, a goalkeeper plays a game, and the football is not in the goal, the commentator may explain that the game is really wonderful and the game is played at a little by little.

In this regard, in this embodiment, the keywords expressing the semantic ambiguity can be selected from the characters used by the commentator in advance and recorded as the third reference word, that is, the third reference word is a keyword expressing the semantic ambiguity, and for example, the third reference word may be, whether it is possible, what seems, what is bad, uncertain, or the like.

For the second text information in the current original audio data, in addition to comparing the second text information with the second reference word, the second text information may be compared with a third reference word.

Step 1045, if the second text information includes the third reference word, determining that the third reference event does not occur in the football match.

If the second text information contains a third reference word, it may be considered that a third reference event has not occurred in the soccer event.

In practical application, the difficulty of directly identifying the semantics of the whole sentence in natural language processing is high, the accuracy is low, and in the embodiment, the filtering of the semantics is performed through the keywords, so that the high accuracy can be ensured.

Step 105, marking the event occurring in the football event as the target event according to the program data in at least one of the first reference event, the second reference event and the third reference event.

As shown in fig. 2, the first reference event, the second reference event, and the third reference event are events predicted to possibly occur by using different technologies, different advantages and disadvantages exist in the different technologies, and different confidences exist in the first reference event, the second reference event, and the third reference event when not used, so in this embodiment, at least one of the first reference event, the second reference event, and the third reference event is selected for different situations, and finally the event occurring in the soccer event is confirmed, and the event occurring in the soccer event is marked with a corresponding time on the program data and is marked as a target event.

Further, in some cases, the first, second, or third reference event alone may be considered a target event occurring in a soccer event.

In some cases, two or three of the first reference event, the second reference event, and the third reference event may be corrected with respect to each other to identify a target event occurring in the soccer event.

In one embodiment of the present invention, step 105 may include the steps of:

and 1051, classifying the first reference event or the third reference event.

In this embodiment, the confidence levels of the first reference event and the third reference event are higher, and the target event occurring in the soccer event can be confirmed independently, or can be confirmed in combination with other events, so that the confidence levels of the first reference event and the third reference event under different conditions can be referred to, different categories can be divided for the first reference event and the third reference event, and the categories can be recorded in the form of configuration files, databases, and the like.

Step 1052, if the category of the first reference event is the first target category, marking the first reference event as an event occurring in the soccer event as a target event for the program data.

If the category of the first reference event is a first target category, e.g., a point, goal, shot, corner, foul, direct free, indirect free, edge serve, center serve, out-of-bounds, etc., indicating that the confidence of the first reference event is high, it is possible to separately confirm that the first reference event is an event occurring in a soccer event, denoted as a target event, and time-stamp the target event at which the first reference event occurred on the program data.

Step 1053, if the type of the third reference event is the second target type, marking the third reference event as an event occurring in the football game as the target event for the program data.

If the category of the third reference event is the second target category, for example, a round trip, a mistake, a rescue, a kill, an oolong ball, etc., indicating that the confidence of the third reference event is high, it may be separately confirmed that the third reference event is an event occurring in a soccer event, which is denoted as a target event, and a time-stamped target event of the third reference event occurs on the program data.

Step 1054, if the category of the first reference event is the third target category, then search for a third reference event matching the first reference event in time.

If the category of the first reference event is a third target category, for example, offside, etc., it indicates that the confidence of the first reference event is general, and the third reference event also has a certain confidence for the third target category, therefore, the first reference event and the third reference event can be combined to confirm the event occurring in the soccer event, at this time, on the time axis of the program data, the first reference event and the third reference event predicted by the time for the same event are searched, so that the first reference event and the third reference event are matched.

In a specific implementation, a general commentator refers to a keyword related to an event after the event occurs, and there is a certain error in identifying the event, so that a first time point at which a first reference event is located may be determined in time, and a time period (e.g., 10 seconds) extends forward and a time period (e.g., 10 seconds) extends backward at the same time along the first time point to obtain a first time period, and a third reference event located in the first time period is queried as a third reference event matching the first reference event.

And 1055, if the first reference event and the third reference event which are matched with each other are the same, marking the first reference event or the third reference event as an event which occurs in the football event as a target event for the program data.

The first reference event and the third reference event which are matched with each other may be compared with each other, if the first reference event is the same as the third reference event, the first reference event or the third reference event may be regarded as an event occurring in the soccer event, and is marked as a target event, and a time of the first reference event or the third reference event occurs on the program data to mark the target event.

Further, since the third reference events may be more in number, when the first reference event is compared with the third reference events, the first reference event is the same as at least one third reference event, and the first reference event or the third reference event may be considered as an event occurring in the soccer event.

Of course, in order to ensure the accuracy of the target event, other conditions may be set for the first reference event and the third reference event to be the same, for example, the number or the proportion of the third reference event that is the same as the first reference event exceeds a preset threshold, and when the conditions are met, the first reference event or the third reference event is considered to be an event occurring in the soccer event, which is not limited in this embodiment.

Step 1056, if the category of the first reference event is the fourth target category, then search for a second reference event matching the first reference event in time.

If the category of the first reference event is a fourth target category, for example, yellow cards, red cards, people changing, etc., it indicates that the confidence of the first reference event is general, and the second reference event also has a certain confidence for the fourth target category, so that the first reference event and the second reference event can be combined to confirm an event occurring in a soccer event, and at this time, the first reference event and the second reference event predicted by the time for the same event are searched on the time axis of the program data, so that the first reference event and the second reference event are matched.

In a specific implementation, elements on the interface such as the text and the icon appear within a period of time (different from 0 to 100 seconds) after the event occurs, so that a first time point at which a first reference event is located may be determined in time, and a second time period may be obtained by extending a period of time (for example, 100 seconds) from the first time point at the same time backward, so as to query a second reference event located within the second time period, as a second reference event matching the first reference event.

Step 1057, if the matched first reference event is the same as the second reference event, marking the first reference event or the second reference event as an event occurring in the football game as a target event for the program data.

The first reference event and the second reference event which are matched with each other may be compared with each other, if the first reference event is the same as the second reference event, the first reference event or the second reference event may be regarded as an event occurring in the soccer event, and is marked as a target event, and a time of the first reference event or the second reference event occurs on the program data to mark the target event.

Further, since the number of the second reference events may be larger, when the first reference event is compared with the second reference event, the first reference event is the same as at least one second reference event, and the first reference event or the second reference event may be considered as an event occurring in the soccer event.

Of course, in order to ensure the accuracy of the target event, other conditions may be set for the first reference event and the second reference event to be the same, for example, the number or the proportion of the second reference event which is the same as the first reference event exceeds a preset threshold, and when the conditions are met, the first reference event or the second reference event is considered to be an event occurring in the soccer event, which is not limited in this embodiment.

In this embodiment, program data of a soccer event is obtained, where the program data includes original video data and original audio data; identifying an event occurring in a football game as a first reference event according to the characteristics of the original video data and the characteristics of the original audio data; identifying an event occurring in the football event according to elements displayed in the original video data as a second reference event; identifying an event occurring in the football event according to the language expressed by the original audio data as a third reference event; and marking the event which occurs in the football event as the target event according to the program data in at least one of the first reference event, the second reference event and the third reference event. The event that this embodiment used original video data, original audio data prediction probably to take place in the football match alone or the combination, make full use of existing material, the cost of control material, the quality of different techniques is different, use the event that takes place in the football match finally to confirm according to the circumstances that uses different techniques prediction alone or fuse, can give full play to the advantage of different techniques, restrain the disadvantage of different techniques, improve the accuracy of the event that detects to take place in the football match.

Example two

Fig. 4 is a flowchart of a football event detection method according to a second embodiment of the present invention, where an operation of live-broadcasting and pushing a highlight content is added to the second embodiment. As shown in fig. 4, the method includes:

step 401, obtaining program data of a football game.

The program data includes original video data and original audio data.

Step 402, identifying an event occurring in the football game as a first reference event according to the characteristics of the original video data and the characteristics of the original audio data.

Step 403, identifying an event occurring in the soccer event according to the elements displayed in the original video data as a second reference event.

Step 404, identifying an event occurring in the football event according to the language expressed by the original audio data as a third reference event.

Step 405, marking the event occurring in the football game as the target event according to the program data in at least one of the first reference event, the second reference event and the third reference event.

And step 406, extracting partial program data containing the target event as an event segment.

In this embodiment, a time point of the target event may be used as a base point, a time period is obtained by extending forward and backward for a certain time period, program data in the time period is extracted, and a context of the event is maintained, so as to form a game segment (also referred to as a highlight segment, a highlight moment, and the like).

Step 407, add a title to the event clip with reference to the target event.

In this embodiment, the target event is subjected to natural language processing to form a title of the event clip, so that the title can embody the target event.

And step 408, if the program data is released in the live broadcast room in a live broadcast manner, pushing the event segments and the titles to the live broadcast room for displaying.

The program data is released in a live broadcast mode in a live broadcast room, a user logs in a client and loads a page of the live broadcast room so as to play the program data, at the moment, an event segment and a title can be pushed to the live broadcast room and displayed in the page, wherein the event segment is displayed in a cover page mode, and if the user clicks the event segment or the title in the client, the client can request the event segment and play the event segment.

In the embodiment, new wonderful content can be automatically produced based on program data, the content production speed is guaranteed, the content is quickly pushed to a user in a live scene, the diversity and the real-time performance of the content are improved, and therefore the watching experience of the user is improved.

EXAMPLE III

Fig. 5 is a flowchart of a football event detection method according to a second embodiment of the present invention, where an operation of live-broadcasting and pushing a highlight content is added to the second embodiment. As shown in fig. 4, the method includes:

step 501, program data with the content of football match is obtained.

The program data includes original video data and original audio data.

Step 502, identifying an event occurring in the football game as a first reference event according to the characteristics of the original video data and the characteristics of the original audio data.

Step 503, identifying an event occurring in the soccer event according to the elements displayed in the original video data, as a second reference event.

Step 504, identifying an event occurring in the football event according to the language expressed by the original audio data, and using the event as a third reference event.

Step 505, according to at least one of the first reference event, the second reference event and the third reference event, marking the event occurring in the football game as the target event for the program data.

Step 506, querying the program data for a second time point at which the target event occurs.

And step 507, notifying the client of the target event and the second time point.

In this embodiment, a second time point at which a part or all of the target events occur may be queried on a time axis of the program data, and if the client requests to play the program data, the client may notify the client of the target event and the second time point, where the client is configured to locate a position indicating the second time point on the progress bar, and display an icon identifying the target event at the position, so that a user may browse which target event occurs at which position, and thus drag the target event to the position on the progress bar, and quickly browse an interested segment.

For example, if the event is a goal, an icon whose content is a soccer ball may be displayed, if the event is a yellow card, an icon whose content is a yellow card may be displayed, and so on.

In the embodiment, the highlight content can be automatically clicked based on the program data, so that the operation of a user is facilitated, and the watching experience of the user is improved.

Example four

Fig. 6 is a schematic structural diagram of a football event detection device according to a third embodiment of the present invention. As shown in fig. 6, the apparatus includes:

a program data obtaining module 601, configured to obtain program data of a football game, where the program data includes original video data and original audio data;

a first reference event identification module 602, configured to identify an event occurring in the soccer event according to the features of the original video data and the features of the original audio data, as a first reference event;

a second reference event recognition module 603, configured to recognize, as a second reference event, an event occurring in the soccer event according to an element displayed in the original video data;

a third reference event recognition module 604, configured to recognize, according to the language expressed by the original audio data, an event occurring in the soccer event as a third reference event;

a target event marking module 605, configured to mark, as a target event, an event occurring in the football event for the program data according to at least one of the first reference event, the second reference event, and the third reference event.

In one embodiment of the present invention, the first reference event identification module 602 includes:

the video feature extraction module is used for extracting at least two features from the original video data as video features;

the audio characteristic extraction module is used for extracting characteristics from the original audio data to serve as audio characteristics;

the live broadcast feature merging module is used for merging at least two video features and the audio features into live broadcast features;

a probability calculation module for calculating the probability of each event occurring in the football event based on the live broadcast characteristics;

and the probability determining module is used for determining the event occurring in the football game as a first reference event according to the probability.

In one embodiment of the invention, the video features comprise a first motion feature, a second motion feature, a timing feature;

the video feature extraction module comprises:

the video network determining module is used for determining a time period network TSN, a time shift model TSM and a video learning network VTN;

the first action characteristic extraction module is used for inputting the original video data into the time slot network TSN to extract characteristics representing actions as first action characteristics;

the second action characteristic extraction module is used for inputting the original video data into the time shift model TSM to extract characteristics representing actions as second action characteristics;

the time sequence feature extraction module is used for inputting the original video data into the video learning network VTN to extract time sequence features as time sequence features;

the live broadcast feature merging module comprises:

the audio network determining module is used for determining a convolutional neural network VGGish;

the audio network processing module is used for inputting the original audio data into the convolutional neural network VGGish to extract characteristics as audio characteristics;

the live broadcast feature merging module comprises:

a feature alignment module to temporally align the first action feature, the second action feature, the timing feature with the audio feature;

and the alignment and combination module is used for splicing the aligned first action characteristic, the aligned second action characteristic, the aligned time sequence characteristic and the aligned audio characteristic into a live broadcast characteristic.

In one embodiment of the present invention, the video network determination module comprises:

the system comprises a data set acquisition module, a data set processing module and a data set processing module, wherein the data set acquisition module is used for acquiring a first data set, a second data set, a third data set and a fourth data set, a first sample video data in the first data set, a second sample video data in the second data set and a third sample video data in the third data set are all marked with actions, and a fourth sample video data in the fourth data set is marked with events of a football match;

a video clip extracting module, configured to extract a video clip containing the event from the fourth sample video data;

the first pre-training module is used for performing first pre-training on a time period network (TSN) constructed on the basis of a backbone network of a first residual error network by using the first data set;

the first fine tuning module is used for fine tuning the time slot network TSN by using the video clip if the first pre-training is finished;

the second pre-training module is used for performing second pre-training on the time shift model TSM constructed based on the backbone network of the second residual error network by using the second data set;

a second fine tuning module, configured to, if second pre-training is completed, fine tune the time shift model TSM using the video segment;

a third pre-training module, configured to perform third pre-training on the video learning network VTN using the third data set;

and the third fine tuning module is used for fine tuning the video learning network VTN by using the video segment if the third pre-training is finished.

In one embodiment of the present invention, the probability calculation module includes:

the encoding layer determining module is used for determining a plurality of encoding layer transformers;

the coding layer processing module is used for sequentially inputting the target characteristics into a plurality of coding layer transformers for coding so as to output coding characteristics;

and the activation module is used for activating the coding features to obtain the probability of each event in the football game.

In one embodiment of the invention, the probability determination module comprises:

the candidate probability extraction module is used for extracting the probability which is greater than or equal to a preset probability threshold value as a candidate probability;

the target probability obtaining module is used for carrying out non-maximum suppression on the candidate probability to obtain a target probability;

and the target probability determining module is used for determining an event corresponding to the target probability of the football match as a first reference event.

In one embodiment of the present invention, the second reference event recognition module 603 includes:

the first image area extraction module is used for extracting pixel points in a specified first range from the original video data to obtain a first image area;

the optical character recognition module is used for carrying out optical character recognition on the first image area to obtain first text information;

the first reference word matching module is used for matching the first text information with a preset first reference word, wherein the first reference word is a keyword for expressing an event occurring in the football match;

and the first reference word determining module is used for determining that an event corresponding to the first reference word occurs in the football match as a second reference event if the matching is successful.

In one embodiment of the present invention, the first image region extraction module includes:

the server inquiry module is used for inquiring a server providing the program data;

the first range query module is used for querying a first range set for the service provider, and the first range is used for displaying characters related to the event of the football event;

and the first range extraction module is used for extracting pixel points positioned in the first range from the original video data to obtain a first image area.

In one embodiment of the present invention, the first reference word matching module includes:

the editing distance calculation module is used for calculating the editing distance between the first text message and a preset first reference word;

and the first matching success determining module is used for determining that the first text information is successfully matched with the first reference word if the editing distance is smaller than or equal to a preset distance threshold.

In another embodiment of the present invention, the second reference event recognition module 603 includes:

the second image area extraction module is used for extracting pixel points in a specified second range from the original video data to obtain a second image area;

a reference image block matching module, configured to match the second image area with a preset reference image block, where the reference image block is an icon used by an event occurring in the football event;

and the reference image block determining module is used for determining that an event corresponding to the reference image block occurs in the football event as a second reference event if the matching is successful.

In one embodiment of the present invention, the second image region extraction module includes:

a second range query module, configured to query a second range set for the service provider, where the second range is used to display an icon used by an event occurring with the football match;

and the second range extraction module is used for extracting pixel points positioned in the second range from the original video data to obtain a second image area.

In one embodiment of the present invention, the reference image block matching module includes:

the cross correlation coefficient calculation module is used for calculating the cross correlation coefficient between the second image area and a preset reference image block;

and the second matching success determining module is used for determining that the second image area is successfully matched with the reference image block if the cross-correlation coefficient is greater than or equal to a preset correlation threshold.

In one embodiment of the present invention, the third reference event identification module 604 comprises:

the voice recognition module is used for carrying out voice recognition on the original audio data to obtain second text information representing sentences;

the second reference word comparison module is used for comparing the second text information with a preset second reference word, wherein the second reference word is a keyword for expressing an event occurring in the football match;

and the second reference word determining module is configured to determine that an event corresponding to the second reference word occurs in the football event as a third reference event if the second text information includes the second reference word.

In an embodiment of the present invention, the third reference event recognition module 604 further includes:

the third reference word comparison module is used for comparing the second text information with a preset third reference word, wherein the third reference word is a keyword for expressing fuzzy semantics;

a third reference word determining module, configured to determine that the third reference event does not occur in the football event if the second text information includes the third reference word.

In one embodiment of the present invention, the target event marking module 605 includes:

a category classification module for classifying the first reference event or the third reference event;

a first target determination module, configured to mark, as a target event, the program data that the first reference event is an event occurring in the soccer event if the category of the first reference event is a first target category;

a second target determining module, configured to mark, as a target event, the program data that the third reference event is an event occurring in the soccer event if the category of the third reference event is a second target category;

a first matching event searching module, configured to search, if the category of the first reference event is a third target category, for a third reference event matching the first reference event in time;

a third target determining module, configured to mark, as a target event, the first reference event or the third reference event as an event occurring in the soccer event for the program data if the first reference event and the third reference event that are matched with each other are the same;

a second matching event searching module, configured to search for the second reference event matching the first reference event in time if the category of the first reference event is a fourth target category;

a fourth target determining module, configured to mark, as a target event, the program data that the first reference event or the second reference event is an event occurring in the soccer event if the first reference event and the second reference event that are matched with each other are the same.

In one embodiment of the present invention, the first matching event search module includes:

a first time point determination module for determining a first time point in time at which the first reference event is located;

the first time period extending module is used for simultaneously extending forwards and backwards along the first time point to obtain a first time period;

a first time period querying module, configured to query the third reference event located in the first time period as the third reference event matching the first reference event.

In one embodiment of the present invention, the second matching event searching module includes:

the second time period extending module is used for simultaneously extending backwards along the first time point to obtain a second time period;

and the second time period query module is used for querying the second reference event positioned in the second time period as the second reference event matched with the first reference event.

In one embodiment of the present invention, further comprising:

an event segment extraction module, configured to extract a part of the program data including the target event as an event segment;

a title adding module, configured to add a title to the event clip with reference to the target event;

and the live broadcast pushing module is used for pushing the event segments and the titles to a live broadcast room for display if the program data is released in the live broadcast room in a live broadcast manner.

In one embodiment of the present invention, further comprising:

the second time point query module is used for querying a second time point of the target event in the program data;

and the event notification module is used for notifying a client of the target event and the second time point, and the client is used for positioning the position representing the second time point on the progress bar and displaying an icon identifying the target event at the position.

The football event detection device provided by the embodiment of the invention can execute the football event detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the football event detection method.

EXAMPLE five

FIG. 7 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 7, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the detection of a soccer event.

In some embodiments, the method of detecting a soccer event may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the method of detecting a soccer event described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the detection method of the soccer event by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for detecting a soccer event, comprising:

2. The method of claim 1, wherein the identifying an event occurring in the football event as a first reference event based on the characteristics of the original video data and the characteristics of the original audio data comprises:

extracting at least two characteristics from the original video data as video characteristics;

extracting features from the original audio data as audio features;

merging at least two of the video features and the audio features into a live feature;

calculating probabilities of occurrence of respective events in the football event based on the live broadcast characteristics;

and determining an event occurring in the football event according to the probability as a first reference event.

3. The method of claim 2, wherein the video features comprise a first motion feature, a second motion feature, a timing feature;

the extracting at least two features from the original video data as video features comprises:

determining a time period network TSN, a time shift model TSM and a video learning network VTN;

inputting the original video data into the time slot network TSN to extract characteristics representing actions as first action characteristics;

inputting the original video data into the time shift model TSM to extract characteristics representing actions as second action characteristics;

inputting the original video data into the video learning network VTN to extract the characteristics on the time sequence as the time sequence characteristics;

the extracting features from the original audio data as audio features includes:

determining a convolutional neural network VGGish;

inputting the original audio data into the convolutional neural network VGGish to extract characteristics as audio characteristics;

said merging at least two of the video features and the audio features into a live feature comprises:

temporally aligning the first motion feature, the second motion feature, the timing feature with the audio feature;

splicing the aligned first action feature, second action feature, timing feature, and audio feature into a live feature.

4. The method according to claim 3, wherein the determining the time period network TSN, the time shift model TSM, and the video learning network VTN comprises:

acquiring a first data set, a second data set, a third data set and a fourth data set, wherein first sample video data in the first data set, second sample video data in the second data set and third sample video data in the third data set are all marked with actions, and the content of fourth sample video data in the fourth data set is a football event and is marked with an event occurring in the football event;

extracting a video segment containing the event in the fourth sample video data;

performing first pre-training on a time period network TSN constructed based on a backbone network of a first residual error network by using the first data set;

if the first pre-training is finished, fine-tuning the time slot network TSN by using the video clip;

performing second pre-training on a Time Shift Model (TSM) constructed based on a backbone network of a second residual error network by using the second data set;

if the second pre-training is finished, fine-tuning the time shift model TSM by using the video clip;

performing a third pre-training on a video learning network VTN using the third data set;

and if the third pre-training is finished, fine tuning the video learning network VTN by using the video segments.

5. The method of claim 2, wherein calculating the probability of each event occurring in the football event based on the live features comprises:

determining a plurality of encoding layer transformers;

sequentially inputting the target characteristics into a plurality of coding layer transformers for coding so as to output coding characteristics;

and activating the coding features to obtain the probability of each event in the football match.

6. The method of claim 2, wherein said determining an event occurring in said soccer event according to said probability as a first reference event comprises:

extracting the probability which is greater than or equal to a preset probability threshold value as a candidate probability;

carrying out non-maximum suppression on the candidate probability to obtain a target probability;

and determining an event corresponding to the target probability of the football game as a first reference event.

7. The method according to any of claims 1-6, wherein said identifying an event occurring in said football event from elements displayed in said original video data as a second reference event comprises:

extracting pixel points in a specified first range from the original video data to obtain a first image area;

carrying out optical character recognition on the first image area to obtain first text information;

matching the first text information with a preset first reference word, wherein the first reference word is a keyword for expressing an event of the football match;

and if the matching is successful, determining that an event corresponding to the first reference word occurs in the football match, and using the event as a second reference event.

8. The method of claim 7, wherein extracting pixel points located in a first range specified in the original video data to obtain a first image region comprises:

inquiring a service provider providing the program data;

inquiring a first range set for the service provider, wherein the first range is used for displaying characters related to the event of the football match;

and extracting pixel points positioned in the first range from the original video data to obtain a first image area.

9. The method according to claim 7, wherein the matching the first text information with a preset first reference word comprises:

calculating an editing distance between the first text information and a preset first reference word;

and if the editing distance is smaller than or equal to a preset distance threshold, determining that the first text message is successfully matched with the first reference word.

10. The method according to any of claims 1-6, wherein said identifying an event occurring in said football event from elements displayed in said original video data as a second reference event comprises:

extracting pixel points in a specified second range from the original video data to obtain a second image area;

matching the second image area with a preset reference image block, wherein the reference image block is an icon used by an event of the football match;

and if the matching is successful, determining that an event corresponding to the reference image block occurs in the football game as a second reference event.

11. The method according to claim 10, wherein said extracting pixel points located in a second designated range from the original video data to obtain a second image region comprises:

inquiring a service provider providing the program data;

querying a second range set for the service provider, wherein the second range is used for displaying icons used by events occurring in the football match;

and extracting pixel points positioned in the second range from the original video data to obtain a second image area.

12. The method according to claim 10, wherein the matching the second image area with a preset reference image block comprises:

calculating a cross-correlation coefficient between the second image area and a preset reference image block;

and if the cross-correlation coefficient is greater than or equal to a preset correlation threshold, determining that the second image area is successfully matched with the reference image block.

13. The method according to any of claims 1-6, wherein said identifying an event occurring in said football event from the language expressed in said original audio data as a third reference event comprises:

carrying out voice recognition on the original audio data to obtain second text information representing a sentence;

comparing the second text information with a preset second reference word, wherein the second reference word is a keyword for expressing an event of the football match;

and if the second text information contains the second reference word, determining that an event corresponding to the second reference word occurs in the football game as a third reference event.

14. The method of claim 13, wherein the identifying an event occurring in the soccer event from the language expressed in the raw audio data as a third reference event further comprises:

comparing the second text information with a preset third reference word, wherein the third reference word is a keyword for expressing fuzzy semantics;

if the second text message includes the third reference word, determining that the third reference event does not occur in the football event.

15. The method of any of claims 1-6, 7-9, 11-12, and 14, wherein said tagging the program data for an event occurring in the soccer event according to at least one of the first reference event, the second reference event, and the third reference event as a target event comprises:

categorizing the first or third reference events;

if the category of the first reference event is a first target category, marking the program data that the first reference event is an event occurring in the football event as a target event;

if the type of the third reference event is a second target type, marking the program data that the third reference event is an event occurring in the football event as a target event;

if the category of the first reference event is a third target category, searching for the third reference event matched with the first reference event in time;

if the first reference event and the third reference event which are matched with each other are the same, marking the first reference event or the third reference event as an event occurring in the football match as a target event for the program data;

if the category of the first reference event is a fourth target category, searching for the second reference event matched with the first reference event in time;

if the first reference event and the second reference event which are matched with each other are the same, marking the first reference event or the second reference event as an event occurring in the football event as a target event for the program data.

16. The method of claim 15,

the searching for the third reference event matching the first reference event in time includes:

determining in time a first point in time at which the first reference event is located;

simultaneously extending forwards and backwards along the first time point to obtain a first time period;

querying the third reference event located within the first time period as the third reference event matching the first reference event;

the searching for the second reference event matching the first reference event in time includes:

simultaneously extending backwards along the first time point to obtain a second time period;

querying the second reference event located within the second time period as the second reference event matching the first reference event.

17. The method of any of claims 1-6, 7-9, 11-12, 14, 16, further comprising:

extracting part of the program data containing the target event as an event segment;

adding a title to the event clip with reference to the target event;

and if the program data is released in a live broadcast mode in a live broadcast room, pushing the event segments and the titles to the live broadcast room for displaying.

18. The method of any of claims 1-6, 7-9, 11-12, 14, 16, further comprising:

querying the program data for a second time point at which the target event occurs;

and informing a client of the target event and the second time point, wherein the client is used for positioning a position representing the second time point on a progress bar and displaying an icon for identifying the target event at the position.

19. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of detecting a soccer event of any one of claims 1-18.

20. A computer-readable storage medium, having stored thereon computer instructions for causing a processor to execute a method for detecting a soccer event according to any one of claims 1-18.