Flat Track Stats Ranking AlgorithmThe Details of How It WorksTransparencyWe believe strongly in sharing this information. However, in the world of computer rankings, an algorithm is usually considered hot intellectual property with it's details carefully guarded. Presumably it's owners, scared that anyone could steal and replicate their ideas, also assume that their customers are comatose and will believe anything they say. We believe that a hidden ranking algorithm is as good as useless. If an algorithm's inner-workings are not understood on at least some general level, it's output should mean nothing. Without documentation to support it, you could be looking at anything, including a simple list of our favorite teams, or worse. We invite your suspicion, and hope that this document answers questions you have about our algorithm. A ranking algorithm attempts to uncover hidden truth about the skill of teams. But above all, please realize the subjectivity inherent in this, and all ranking algorithms. It is subjective in that we have chosen which factors to calculate. It is subjective in how we combine these factors and decide which factors are more important than others. In fact, there is only one noteworthy advantage to computer rankings; they have the ability to apply their rules completely equally and objectively to vast amounts of data; potentially far more data than a human can keep track of. We believe this advantage makes computer ranking a most worthy endeavour, as long as the medium is understood by it's consumers. Please see Why This Site Was Created for more information on our intentions. How It Works: The Short AnswerThe software that calculates the Flat Track Stats rankings includes many factors in an attempt to determine the skills of roller derby teams. Roller Derby in its current state is a very difficult sport to rank. The recency of the sport presents a very disparate playing field with established teams performing considerably better than newer teams. At the same time, these rankings are increasingly volatile as newer teams disrupt current standings with their newly gained skills. In a given season, travel teams often play only a few bouts, so the sample set can be small. This gives us a poor reading of true skill. Even more problematic is how unequal the number of bouts that are played per team. While some teams may only play a few bouts, others may play more than a dozen. Including all these teams on the same ranking scale under these conditions is a considerable challenge. Our algorithm considers the following factors before deciding on a team's rank. They are roughly ordered by importance, with the largest factors first.
How It Works: The Long AnswerFor every bout a team plays, the algorithm weighs the outcome of a game according to many factors. The following is a description of each of those factors, and how they are combined to form a team's ranking. Score Difference (SD)The team's score minus the opponent's score makes up it's Score Difference (SD). For a given bout, an SD provides an initial metric of performance. This is the most discernible method of assessing skill, and infers a team's total score difference (totalSD) and avg score difference (avgSD) over a given period of time. For example, an SD of 4 is a close win, and an SD of -72 is a blowout loss, while a totalSD of 8 shows a team who is doing just a bit better than breaking even. Blowout Adjustment (nSD)The algorithm implements diminishing returns on blowout wins and losses. Winning by a healthy amount will weight higher than a close win, but continuing to run up the score against a weaker opponent will start to grant less and less weighting. This transformation is arrived at through a derivative of a sigmoid logistic function. This is just an S-curve, similar to exponential falloff, except it allows for more precise control over how this falloff is shaped. Large blowout wins or losses tend toward the asymptotes of the equation. The output of this function is a normalized score difference (nSD) between -1.0 and 1.0. -1.0 and 1.0 are asymptotes and represent a loss or win of infinite SD. An nSD of 0.0 for a particular bout would be a tie. ![]() Opponent Skill SeedingThe skill of an opponent is important to determine because a team could run up their avgSD by consistently playing weaker teams. The opponent's skill is determined by seeding the algorithm with historical ratings. Today's current FTS rankings looks to the most recent FTS quarterly snapshot, and grabs the rating for an opponent from there. This means that every quarter, when new quarterly rankings are frozen and stored, these ratings become the new seeds for new bouts. Our first quarterly rankings (2007 Q4) in January 2008, derived seedings from the 2007 Q2 WFTDA ranks, and considered all bouts from 2007. We chose 2007 Q2 for the seeds because it was halfway through the year, and had one of the most teams in it's rankings. It isn't perfect to seed with rankings because it doesn't acknowledge the jumps in skill between sections of the rankings, but now that we're on rating system, the data will start to normalize itself. Bout Outcome Factor (BOF)This function funnels crucial bout metrics into one comparable factor. The opponent rating (or ranking when seeding from WFTDA data) is normalized to the scale of all ratings, with 0.0 being the worst rated and 1.0 being the best rated. The normalized rating (nOppRating) is multiplied by the nSD to fulfill the following logical statements:
After all bouts for a team have been accounted for, a mean average is calculated and assigned to a team as their BOF.
![]() Participation (nPar)The total bouts a team plays is an important factor because it shows how much the team has put itself on the line to achieve it's record. A team that only plays a few games can have a near perfect record, but clearly, a team that plays twice the number of games with the same record is a better team. The normalized participation (nPar) ranks a team's participation in relation to all other team's participation. A team's nPar of 0.0 played the lowest number of bouts of all teams, an nPar of 1.0 played the highest number of bouts, and an nPar of 0.5 played the average number of bouts. Low ParticipationTeams that have played only one or two bouts are separated from the rest of the rankings and placed in the Low Participation Rankings. They remain here, ranked relative to other low participators, until they've played their third bout. Note that their rating is not being penalized for low participation, and it is easy to see where they would be placed in the real rankings, if they were not separated. This comparison however is not very meaningful, as one or two bouts do not give a solid indication of skill. See also the FAQ on low participation. Tournament BonusTeams that play at tournaments are given bonus weighting proportional to their performance at the tournament. The reasons for factoring in tournament play are that many bouts were played in a short period of time, the increased pressure of tournament play, and the acknowledged importance of tournament wins to the derby world, all indicate that doing well in a tournament should be worth more than regular bouts. Tournaments are assigned an importance factor (national tournaments are worth double that of regional tournaments). 1st place winners are assigned the full point value of a tournament (tBonus), while runner-ups receive a portion of tBonus determined by a falloff curve. ![]() Output Data - Final RatingThe final rating of a team is the combined factor of it's BOF, nPar, and tBonus. The BOF remains as the most influential factor, the nPar offsets errors in teams that don't play many bouts (but do well), and tBonus factors in the relevancy of teams that win (and do well) in tournaments. The algorithm produces a rating of every participating team. Sorting this result by highest first is what determines our rankings. It is important to point out the differences between a rating and a ranking. A rating system represents a continuum of possibilities, with no guarantees about the density of teams plotted on this line. In this algorithm, the rating is a measure of performance. A greater difference between two team's ratings should reflect a greater difference in their skill. In contrast, a ranking scale merely shows the order of performance from best to worst. It does not show for instance how much worse the 4th place team is than the 3rd place team. Nor is this difference necessarily representative of the difference between any other two rankings. We have decided to work with a rating system, because as a more fundamental metric, it allows for nuanced and more meaningful results. It is trivial to convert ratings to rankings, which is the primary method of displaying the results on Flat Track Stats. |
|