Disclosure of Invention
The invention aims to solve the technical problem of how to analyze and evaluate the game performance of football players by machine learning, and creatively provides a game performance evaluation method for the football players based on emotion calculation.
The method has the innovation points that expressions related to player performance in the football newspaper are subjected to structured processing and quantization by utilizing an emotion calculation technology and a text information extraction technology, and then are combined with statistical data, and player performance scores are output by utilizing a linear regression algorithm. The method combines the statistical information of the players and the text information of the war newspaper, and can evaluate and judge the performance of the players more scientifically and reasonably. As shown in fig. 1.
In order to achieve the purpose, the technical method adopted by the invention is as follows:
a football player competition performance evaluation method based on emotion calculation comprises the following steps:
step 1: and training an emotion calculation model. As shown in fig. 2, the method comprises the following steps:
step 1.1: collecting a plurality of pieces of relevant sports news as a corpus, taking CBOW as a network, and training a word2vec model with 100-dimensional word vectors as output, namely, converting each Chinese vocabulary into 100-dimensional vectors.
Step 1.2: and acquiring a post-match report written by a football commentator for a plurality of football matches (such as all the matches in the China and super league 2018 season).
Step 1.3: the whole war paper is divided into several sentences.
Step 1.4: and (3) performing word segmentation on the sentence, converting each word into a 100-dimensional word vector by using the word2vec model trained in the step 1.1, and further converting the sentence into a vector sequence.
Step 1.5: the annotator is invited to classify the emotional tendency of each sentence into four categories: positive, neutral, negative, independent of player performance. Then, sentences that are neutral and irrelevant to player performance are culled.
Step 1.6: in order to enhance the robustness of the text emotion calculation model and solve the problems of small data volume and unbalanced positive and negative samples, positive and negative sentences reserved in the step 1.5 are subjected to data amplification by using a disorder method. That is, the sentence is first participled, and then the order of the words is randomly changed.
Step 1.7: and taking the word vector sequence corresponding to each sentence as input, taking the emotion score corresponding to the sentence as an output label, and training by using a long-time memory network (LSTM) to obtain an emotion calculation model.
The cross entropy is selected as a loss function, parameter updating is completed by using an Adam optimizer, and the training algebra is preferably 128 generations. The method comprises the following specific steps:
information of current state increase:
updated information of the current state:
wherein the output information of the current state is
c
<t>The unit state information at the time t is used for recording the information stored by the network until the time t;
information indicating an increase in state at time t; x is the number of
<t>Inputting information for the network at the time t; h represents the output value of the LSTM network; h is
<t>Is the output value of the LSTM network at the time t; h is
<t-1>Is the output value of the LSTM network at the time t-1; σ is an activation function, typically sigmoid or tanh. W
f、W
u、W
o、W
cIs a parameter matrix, b
f、b
u、b
o、b
cThe parameter vectors are obtained by training through a echelon descent method.
FIG. 3 shows the specific structure of LSTM, and FIG. 4 shows the complete structure of the emotion calculation model based on LSTM neural network.
Step 2: and extracting the text information. As shown in fig. 5, the method comprises the following steps:
step 2.1: the current war newspaper is divided into a plurality of sentences.
Step 2.2: after Chinese word segmentation and part-of-speech tagging are completed on a sentence, extraction and matching are completed on player names and event names (such as goal) in the sentence through a rule-based text extraction technology, and a binary group consisting of the player names and the event names is obtained.
Step 2.3: and (4) performing word segmentation and vectorization on the sentence, and converting the sentence into a word vector sequence by using the word2vec model trained in the step 1.4.
Step 2.4: inputting the word vector sequence into the emotion calculation model trained in the step 1.7, and outputting a corresponding emotion score: between-1 and 1, where-1 represents extremely negative and 1 represents extremely positive. And combining the two-tuple obtained in the step 2.2 to obtain a triad (such as Meixi-shooting-0.87) consisting of the name of the player-the event-the emotion score.
And step 3: training the player performance evaluation model and outputting a player performance score. As shown in fig. 6, the method comprises the following steps:
step 3.1: obtaining technical statistics of players of a plurality of football games (such as all games in 3 to 4 seasons of a certain tournament) and scoring of the players (such as scoring of an authoritative football data website) by a third party.
Step 3.2: players are divided into goalkeepers and non-goalkeepers. Wherein, the goalkeeper technical statistics item includes the necessary technical dimensions: time to live, miss results in goals, number of goals for a rescue, number of yellow cards, number of red cards, number of plays, number of passes, success rate of pass, and success rate of long pass. As shown in table 1, a goalkeeper technical statistics term is given that includes 22 technical dimensions.
TABLE 1 goalkeeper technical statistics terms
The technical statistics of the non-goalkeeper member include the following necessary technical dimensions: the number of the balls is the number of the hit, the number of the goals, the number of the oolong, the number of the attack aids, the number of the yellow cards, the number of the red cards, the number of the goal shots, the number of the righting shots, the number of the pass, the success rate of the pass, the number of the key pass, the success rate of the top competition, the number of the ball with the shot, the number of the offences, the number of the broken shots, the success rate of the breaking, the number of the defenses, the success rate of the center pass, the number of the. As shown in table 2, a non-goalkeeper technical statistics term is given that includes 36 technical dimensions.
TABLE 2 technical statistics of non-goalkeeper terms
A player score is a score between 0 and 10, and may be accurate to one decimal point (e.g., 7.1 points).
Step 3.3: dividing all the technical statistical data and scores of players in a scene into a training set, and dividing the technical statistical data and scores of players in a part of the scene (such as a certain season scene) into a test set.
Step 3.4: training linear regression models of goalkeeper and non-goalkeeper.
The specific formula is as follows:
linear regression equation: f (x) w0+w1x1+w2x2+…+wmxm
Data set: d { (x)1,y1),(x2,y2),…,(xi,yi),…,(xn,yn)};
wherein x is1,x2,…,xmCounting the numerical values of all dimensions for player technology, wherein m is the dimension of the feature; w is a1,w2,…,wmA weight value corresponding to each feature; w is a0Is the intercept; n is the number of data sets; (x)i,yi) Scoring the technical statistics and performance corresponding to player i; j (W) is a cost function and represents the difference between the fitting result f (X) obtained by the linear regression equation and the true value Y. Obtaining each weight and intercept by minimizing the cost function, and finally obtainingAnd (5) linear regression model.
Step 3.5: matching the name of the player and the name of the event in the triad obtained in the step 1.7 with the name of the player and the name of the statistical item in the technical statistics, and adding the emotion score of the event of the player with the numerical value of the corresponding technical statistical item to obtain the technical statistics of the player combined with the text information (for example, the shooting number of the Meixi full match is 3, the triad obtained in the war about the Meixi shooting is Meixi-shooting-0.87, and the final Meixi shooting number is 3.87).
Step 3.6: and (4) respectively sending the players into the models trained in the step 3.4 according to the goalkeeper and the non-goalkeeper to obtain the final player performance scores.
Advantageous effects
Compared with the prior art, the method of the invention has the following advantages:
1. the existing method only depends on technical statistical information, can only reflect the number of certain events of players in a game, and cannot reflect the quality (for example, one wonderful pass and one bad pass are both marked as one pass). According to the method, on the basis of inputting player technical statistics, the quality of events is quantized by introducing the report text information and utilizing the emotion calculation technology, so that the quantity and the quality of technical items are considered simultaneously, and the match performance of players can be judged more scientifically and reasonably.
2. Based on the method, the performance of the players can be transversely and longitudinally compared by monitoring the competition performance of the players for a long time, and the team relation layer, the coach and the players can help the teams to obtain better results.
Examples
The invention consists of three parts: the emotion calculation model part, the structure diagram of which is shown by FIG. 2; a text information extraction section, a flowchart of which is shown in fig. 5; the player performance assessment model section, a flow chart of which is shown in fig. 6.
Step 1: and training an emotion calculation model.
Step 1.1: and training a word vector model. The sports text in the THUCNews is used as a corpus, and 131604 texts are used in total. And (3) training a word2vec model with output of 100-dimensional word vectors by taking CBOW as a network, namely converting each Chinese vocabulary into 100-dimensional vectors.
Step 1.2: and acquiring a football war newspaper text.
The method comprises the step of crawling 480 post-match wars given by each match of the Xinlang sports and the fox searching sports in the middle-to-over 2018 match season. The whole war newspaper is divided into a plurality of sentences by taking commas, periods, exclamation marks and question marks as intervals.
Step 1.3: and marking the text emotion.
And inviting people who know the football to carry out emotion marking on the divided sentences. The emotional tendency of each sentence is divided into four categories: positive, neutral, negative, independent of player performance. Then, sentences that are neutral and irrelevant to player performance are culled. 6182 corpora with positive emotional tendency and 1078 corpora with negative emotional tendency are obtained finally.
Step 1.4: and performing data amplification.
In order to solve the problems of small data volume and unbalanced positive and negative samples, the data amplification is carried out on the reserved positive and negative corpora in a disorder mode. I.e. the sentence is participled first, and then the order of the words is randomly changed. 12146 pieces of data were obtained, of which 11146 pieces of data were used as a training set (positive: 5663; negative: 5483) and 1000 pieces of data were used as a test set (positive: 500; negative: 500). Wherein, the data in the test set is unordered corpus.
Step 1.5: and (5) text vectorization processing.
The names of players and the professional terms of the football are preset, and the preset words are guaranteed not to be segmented by mistake during word segmentation. And calling a jieba word segmentation component, performing word segmentation on the sentences, converting each word into a 100-dimensional word vector by using a trained word2vec model, and further converting the sentences into a vector sequence.
Step 1.6: and (5) training an emotion calculation model.
And taking a word vector sequence corresponding to each sentence as input, taking the emotion score corresponding to the sentence as an output label, and training by using a long-time memory network (LSTM) to obtain an emotion calculation model. Wherein, the cross entropy is selected as a loss function, parameter updating is completed by using an Adam optimizer, and the training algebra is 128 generations. FIG. 4 shows the complete structure of the emotion calculation model based on the LSTM neural network. Fig. 3 shows the structure of LSTM, which is calculated as follows:
information of current state increase:
updated information of the current state:
wherein the output information of the current state is
c
<t>The unit state information at the time t is used for recording the information stored by the network until the time t;
information indicating an increase in state at time t; x is the number of
<t>Inputting information for the network at the time t; h represents the output value of the LSTM network; h is
<t>Is the output value of the LSTM network at the time t; h is
<t-1>Is the output value of the LSTM network at the time t-1; σ is an activation function, typically sigmoid or tanh. W
f、W
u、W
o、W
cIs a parameter matrix, b
f、b
u、b
o、b
cThe parameter vectors are obtained by training through a echelon descent method.
Step 2: and extracting text information.
Step 2.1: and obtaining and segmenting texts.
And acquiring the report of the current match, and dividing the report of the current match into sentences.
Step 2.2: and extracting the text relation.
And obtaining a two-element group consisting of the player name and the event name through a rule-based text information extraction technology. Specifically, when the game is implemented, the player lists of the two parties of the current game and preset standard event names (such as goal, pass, and the like) are loaded firstly. And then, realizing word segmentation and part-of-speech tagging by utilizing a jieba word segmentation component. And then extracting and matching the player names and the event names through a rule-based text information extraction technology, and finally extracting a two-tuple consisting of the player names and the event names (such as Meixi-shooting).
Step 2.3: and calculating emotion scores.
After Chinese word segmentation and part-of-speech tagging are completed on a sentence by using the jieba word segmentation component, the word is sent into a trained word2vec model and converted into a 100-dimensional word vector. And (3) inputting the word vector sequence into the emotion calculation model trained in the step 1.6, and outputting corresponding emotion scores (-1 to 1, -1 represents extreme negative, and 1 represents extreme positive). And combining with the binary group obtained by text relation extraction to obtain a triple (such as Gao Lin-goal-0.87) consisting of the name of the player, the event and the emotion score.
And step 3: training the player performance evaluation model and outputting a player performance score. As shown in fig. 6, the method comprises the following steps:
step 3.1: and acquiring data.
And (3) acquiring technical statistical data of players of each match from the middle-super 2016 to the 2019 season and scores of the players given by an authoritative football data website whoscored.com by using a crawler technology. The players are divided into goalkeeper and non-goalkeeper on the scene. The goalkeeper's technical statistics include: 22 technical dimensions such as time to live, number of putting-out, number of successful putting-out and number of passing balls; non-goalkeeper technician technical statistics include: the time of getting on the scene, the number of shooting gates, the number of shooting corrections, the number of key passing balls, the number of snap-ins and other 36 technical dimensions. The player scores are a score between 0 and 10, to the nearest decimal point, e.g. 7.1.
Step 3.2: dividing the technical statistical data and scores of players from 2016 to 2018 into a training set, and dividing the technical statistical data and scores of players from 2019 into a testing set.
Step 3.3: goalkeeper and non-goalkeeper linear regression models were trained.
The specific formula is as follows:
data set: d { (x)1,y1),(x2,y2),…,(xi,yi),…,(xn,yn)};
wherein x is1,x2,…,xmCounting the numerical values of all dimensions for player technology, wherein m is the dimension of the feature; w is a1,w2,…,wmA weight value corresponding to each feature; w is a0Is the intercept; n is the number of data sets; (x)i,yi) Scoring the technical statistics and performance corresponding to player i; j (W) is a cost function and represents a fitting result f (X) obtained by a linear regression equation) The difference from the true value Y. And obtaining each weight and intercept by minimizing the cost function, and finally obtaining a linear regression model.
Step 3.4: and correcting technical statistical data.
And matching the name of the player and the name of the event in the triad obtained in the step 2.3 with the name of the player and the name of the statistical item in the technical statistics, and adding the emotion score of the event of the player with the numerical value of the corresponding technical statistical item to obtain the technical statistics of the player combined with the text information (for example, Gao forest goal number of the whole game is 3, a triad obtained in a war newspaper and related to Gao forest goal is Gao forest goal-0.87, and the final Gao forest goal number is 3.87).
Step 3.5: and outputting the player performance score.
The players are classified into goalkeepers and non-goalkeepers according to the player categories. And (4) sending the corrected technical statistical data into the trained model corresponding to the step 3.3 to obtain the final player performance score.