Ranking Multiple Blindfolded Rubik’s Cube Solves

A data-driven proposal for a new system for ranking attempts at solving multiple Rubik’s cubes blindfolded.

Brendan Gray
32 min readFeb 4, 2024

tl;dr — Explore the alternative rankings here.

Solving Rubik’s cubes as fast as possible is a fairly niche hobby. Of those who enjoy speed solving, some solve the cube blindfolded — they start a timer, memorise a scrambled cube, don a blindfold, and then solve the cube without looking at it. Then there are a handful who take it to yet another level: multiple blindfolded solving. Here, competitors spend up to an hour committing several scrambled cubes to memory, and then solving them all consecutively without lifting their blindfold.

I want to take a data-driven look into the existing ranking system using nearly 20 thousand multiple blindfolded results from over 17 years of competitions, and try answer the following questions:

  1. What does the data actually look like?
  2. How fair is the current ranking system?
  3. Can we come up with something better?
A depiction of a Rubik’s Cube multiple blindfolded attempt. Image by ChatGPT.

The World Cube Association (WCA) oversees speed cubing competitions worldwide and curates the results. The current world record for multiple blindfolded solving is by Graham Siggins, who solved 62 out of the 65 cubes he attempted in a competition in June 2022.

Graham Siggins solving 62 out of 65 cubes in just under 58 minutes — the current world record.

Unlike other speed-solving events where all that matters is the time, there are several dimensions to a multiple-blindfolded result — the number of cubes solved, the accuracy, and the time taken — and it is not immediately obvious to someone outside of the community how to balance these factors to to rank results.

The Current Ranking System

Multiple blindfolded solves are ranked using a points system introduced in April 2008. You take the number of cubes solved successfully, and subtract the number of cubes still unsolved at the end of the attempt. Ties in points are broken first using the time (faster time is better), and then by the number of cubes missed (fewer cubes missed is better).

Using this system, the world record of 62/65 is 59 points (62 solved minus 3 missed) which beat Graham’s previous word record of 59/60 (58 points) in 59:46.

In order to be considered a successful result, at least half of the attempted cubes must be solved, and at least 2 cubes must be solved.

An Interesting History Note

Prior to the April 2008 regulations update, 100% accuracy was prioritised as the most important dimension, leading to absurd situations like that at the Toronto Open Winter 2008. Ryosuke Mondo solved an impressive 17 out of 18 cubes, but his one mistake meant that he only came in 3rd place, losing to Rowe Hessler (2/2) and Eric Limeback (3/3). Ryosuke’s result would have been world record had it been set one month later after the new regulations came into effect. Instead, the world record remained Dennis Strehlau’s 10 out of 10 from the Belgian Open 2008, and Ryosuke’s impressive 17/18 placed him in only 49th in the world rankings, behind a host of 2/2 results.

Interestingly, when the current point system was introduced, many were opposed to the change, with some arguing that a single mistake on a single cube should result in the entire result being disqualified.

Fortunately sanity prevailed, and the more forgiving format led to an explosion in the number of cubes people were willing to attempt. In fact, so much so that in February 2009, the WCA imposed an overall time limit of one hour in addition to the time limit per cube in order to make the event more manageable for competition organisers.

After the last time limit change, the WCA kept all old results under the event “3x3x3 Multi-Blind Old Style”, and created a new multiple blindfolded event for new results going forward. They migrated all historical results that were valid under the new regulations (non-negative number of points, and within the hour time limit), so the new format has results going back all the way to 2007.

One other change to the event came with the January 2014 regulations. Prior to this, attempting 2 cubes and successfully solving only 1 was considered a success. Now, the competitor must solve at least 2 cubes correctly for it to be considered a successful multiple blindfolded result.

The “Flaws”

Whenever a ranking needs to combine multiple dimensions, you have to choose some way to either prioritise dimensions, or perform some weighted combination of the dimensions. This choice is going to be subjective, because people will always have different opinions about the relative importance of the dimensions.

In my subjective opinion, while the current points system works well at balancing the number of cubes solved relative to the attempt size, it places very little weight on time, and punishes poor accuracy perhaps slightly too harshly.

As an exercise, have a look at the following ten results, and try find a way to order them in terms of your perception of the competitor’s skill.

 8/ 9 in 52:37
31/55 in 53:34
9/11 in 54:37
14/21 in 55:05
7/ 7 in 56:59
6/ 6 in 53:15
15/24 in 54:10
7/ 8 in 55:57
29/52 in 56:16
6/ 6 in 57:44

If you managed to notice that all ten results are already ordered according to the WCA’s current ranking system, then well done! If not, compare your ordering to the order here.

Now reflect, is this ordering fair? Does 29/52 cubes really reflect the same ability as a 7/8 in the same time? Is the speed and memory capacity required to fit 52 cubes into an hour with poor accuracy exactly equivalent to the much slower but much more accurate 7/8 in the same time?

Is it fair that the 24 and 52 cube attempt can fall in between two 6 attempts, and even tie on points, yet it’s impossible for an even sized attempt to tie on points with an odd sized attempt?

You might also be thinking, surely the person attempting 52 cubes with 55% accuracy has the ability to be much more accurate if they spent a little more time per cube and had fewer cubes to remember? Maybe. But does that mean they’re better? Are we measuring potential? Or are we interested in performance on the day?

The problem is that the answers to these questions depend on whether you consider speed, attempt size, or accuracy to be a better indicator of the competitor’s overall performance, and that will be a subjective opinion.

To answer which of these factors is really more important (if any), we really should be looking at the data.

The Data

The WCA provides a results export that can be imported into a MySQL database. I’m using the results as of 11 December 2023.

We’re interested in the multiple blindfolded results in the Results table, and we’re interested in the dates of the competitions at which they happened, which we can get from the Competitions table.

The Results table has more than 4 million rows, and each row can contain up to three attempts. Queries against the results tables tend to be a little slow, but we can shrink this down to a much more manageable data set by extracting the multiple blindfolded solves only upfront.

CREATE TABLE MultiResultsExtract AS
SELECT *
FROM Results
WHERE eventId='333mbf'
;

Each row can contain up to three attempts in different columns, so we have to do a few unions to make sure we get all results in a single column. We’re also only interested in successful attempts, so we only consider results greater than zero.

CREATE TABLE RawMultiSolves AS
SELECT personId, competitionId, value1 AS result
FROM MultiResultsExtract
WHERE value1 > 0
UNION
SELECT personId, competitionId, value2 AS result
FROM MultiResultsExtract
WHERE value2 > 0
UNION
SELECT personId, competitionId, value3 AS result
FROM MultiResultsExtract
WHERE value3 > 0
;

We’re going to want to know the approximate date on which results were achieved. This will let us figure out the order in which a person attained their results, and do things like, for example, find their largest previous attempt before the current result. Competitions can span multiple days, and we have no easy way of knowing which day each result was set. So we’ll just take the start day of the competition as a best guess.

The Competitions table stores the year, month and day in separate columns. We’ll format the date as an integer in yyyymmdd format. This is an absolutely terrible format for storing a date, and makes any datetime operations a pain, but it’s a convenient format for quick and dirty analysis where all you need is to view and compare dates.

CREATE TABLE CompDates AS
SELECT
id AS competitionId,
year*1e4 + month*1e2 + day AS date
FROM Competitions
;

We can then create the main table we will be working with. We want to extract the number of cubes attempted and successfully solved from the encoded result. We want to number the attempts to make it a little easier to deal with historical attempts in the future.

CREATE TABLE MultiSolves AS
SELECT
personId,
RawMultiSolves.competitionId,
date,
result,
(99 - (FLOOR(result / 1e7) % 100)) + (result % 100) as solved,
(99 - (FLOOR(result / 1e7) % 100)) + 2 * (result % 100) as attempted,
FLOOR(result / 100) % 1e5 as seconds,
row_number() over (partition by personId order by date) as attempt_num
FROM
RawMultiSolves
LEFT JOIN CompDates on RawMultiSolves.competitionId = CompDates.competitionId
;

Finally, we want to add some extra columns to indicate accuracy and speed (in the form of time per cube).

ALTER TABLE MultiSolves
ADD COLUMN timePerCube DOUBLE,
ADD COLUMN accuracy DOUBLE
;

UPDATE MultiSolves
SET
timePerCube = seconds / attempted,
accuracy = solved / attempted
;

Assumptions

We’re going to assume that there is some single underlying variable that is proportional to some sort of multiple blindfolded skill. This assumption is necessary if we hope to be able to rank solves unambiguously. We may not be able to measure this value directly, but we can assume that it impacts other variables in some way.

Measurable outcomes that we will assume to be indirectly affected by this supposed underlying skill are:

  • Accuracy: the fraction of cubes attempted that were solved successfully. An interesting quirk about accuracy is that while it is a continuous variable in theory, it becomes discrete when considered for a particular number of attempted cubes, and this introduces challenges when directly comparing accuracy across attempts of different numbers of cubes.
  • Speed: which we will measure by looking at the time spent per attempted cube. It is also possible to consider the time spent per solved cube, as one could imagine a competitor discarding more difficult scrambles and spending time on the easier scrambles only. If we measure the time per attempted cube, we overestimate the speed of people using this strategy. However, while discarding difficult scrambles is often done by beginners aiming for just a successful result, it is not a good strategy when the goal is to maximise points, and is rarely used by more experienced solvers. Most solvers spend effort on all cubes, so considering the time per attempted cube is the most reasonable reflection of speed. Another consideration is that this overestimates the speed of people who exceed the time limit. So we need to bear in mind that the time per attempted cube is only the upper limit of an estimate of a person’s speed. Yet another significant complication of speed is that it is not independent of attempt size. A person doing a larger attempt will typically need more reviews. But for the purposes of this exercise, we will ignore this effect, and assume that a small increase in the number of cubes that a person attempts will not have a significant impact on the speed per cube.
  • Memory capacity: which we will measure by looking at the number of cubes attempted. The time limit is a considerable complication here, because memory capacity is a skill that is surprisingly easy to improve with minimal practice — considerably more so than speed. In general, we will see that competitors are constrained by the number of cubes they can practically manage within the hour time limit, and it is unusual for a person to achieve a speed such that the constraining factor is memory capacity rather than time.

We will revisit these variables later, but they will form the starting point when looking at the data.

We do need to be mindful of not falling into the trap of interpreting artificial structures that are introduced by complications like the time limit, or the discrete nature of accuracy. We can acknowledge these structures when they arise, but should not use them to draw conclusions regarding competitor’s skill.

Initial Data Exploration

We’ll start by looking at how our variables are distributed. We’ll do this using Python, Pandas, and Seaborn. We can import the libraries we need, and fetch our data as follows:

from typing import Tuple, Optional
import os
from mysql import connector as mysql
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


def fetch_data() -> pd.DataFrame:
conn = mysql.connect(
database='wca_public',
user = os.environ['MYSQL_USER'],
password = os.environ['MYSQL_PASSWORD'],
)

query = "SELECT * FROM multisolves;"

data = pd.read_sql(query, conn)
data['accuracy'] = data['accuracy'] * 100
return data

Attempt Size

A lot of cubes in a pile. Image by Midjourney.

We start by looking at the distribution of attempt sizes.

def cube_count_distributions(data: pd.DataFrame) -> None:
fig, axes = plt.subplots(1, 2, sharey=True)
sns.histplot(
data=data,
x='solved',
ax=axes[0],
bins=range(0, 70, 1),
)
sns.histplot(
data=data,
x='attempted',
ax=axes[1],
bins=range(0, 70, 1),
)
plt.show()
Distributions of the number of cubes solved (left) and attempted (right).

The number of cubes attempted follows an exponential distribution, with the most common attempt sizes being 2, 3, and 4 respectively. Up to 10 cubes, even numbers of cubes are more common — this is likely because an even number of cubes is slightly more forgiving. Exactly 50% accuracy is enough to get a success with an even number of cubes, whereas an odd number of cubes requires exceeding 50%.

Above 10 cubes, a common strategy is to memorise in packs of 4 (for fewer cubes) or 8 (for larger attempts). Those memorising in packs will often keep a single cube separate that they will memorise to very short term memory at the very end of the memorisation phase, and execute first. This is the reason for spikes at 13, 17, 25, 33 and to some extent, 37 cubes. After this point, memorisation strategies become far more personal, and the decision of how many cubes a person attempts is more influenced by a target number of points to achieve a record, rather than strategies. This is why we see a spike at 42 cubes (the world record stood at 41 points for almost 5 years, making 42 a goal for many). There’s also a jump to 60 amongst the absolute world class solvers due to the appeal of the one cube per minute barrier, and the current world record of 59 points.

Considering the distribution of solved cubes, we see a number of results with only one cube solved from prior to the 2014 regulation change. Two cubes solved is the most frequent, and necessarily higher than the count of 2 cubes attempted, as it is also possible to solve only 2 cubes when attempting 3 or 4 cubes.

The distribution is much smoother, because the number of cubes attempted is a deliberate choice by the competitor, whereas they have less control over the number of cubes solved.

Accuracy

Midjourney’s interpretation of a Rubik’s Cube and accuracy.

We can use a similar approach to look at how accuracy is distributed.

def accuracy_distribution(data: pd.DataFrame) -> None:
sns.histplot(
data=data,
x='accuracy',
bins=range(50, 101, 2),
)
plt.show()
Distribution of accuracy.

Spikes in the distribution of accuracy at 50%, 60%, 66%, 75%, 80%, and 100% are a result of the divisibility of the number of cubes attempted, and not a reflection of people’s ability. This makes it difficult to see the actual underlying distribution.

However, we can do something a little unorthodox to see past these spikes. We can assume that accuracy is drawn from a continuous distribution, but the value “snaps” to the nearest available value in the result. We can simulate this by adding a uniform “fuzziness” to the accuracy measurements across the range of uncertainty. For example, if someone solves 3 out of 4 cubes, instead of assigning all of these an accuracy of 75% exactly, we assign an accuracy chosen uniformly from the range [62.5, 87.5). We also exclude attempts of 2 cubes, as these always have exactly 100% accuracy under the current regulations.

from numpy.random import random_sample

def add_fuzziness(data: pd.DataFrame) -> pd.DataFrame:
fuzzy_data = data.copy()

fuzziness = (-0.5 + random_sample(len(fuzzy_data))) * (100 / fuzzy_data['attempted'])
fuzzy_data['fuzzy_accuracy'] = fuzzy_data['accuracy'] + fuzziness

out_of_range = (fuzzy_data['fuzzy_accuracy'] > 100) | (fuzzy_data['fuzzy_accuracy'] < 50)
fuzzy_data.loc[out_of_range, 'fuzzy_accuracy'] = fuzzy_data.loc[out_of_range, 'fuzzy_accuracy'] - 2 * fuzziness[out_of_range]

return fuzzy_data
Distribution of accuracy with fuzziness applied to smear measurement uncertainty.

This distribution is far smoother and surprisingly flat. One could argue that there appears to be a slight peak between 75% and 85%.

Let’s see if there is a relationship between accuracy and attempt size.

def filter_by_attempt_size(data: pd.DataFrame, size_range: Tuple[int, int]) -> pd.DataFrame:
return data[(data['attempted'] >= size_range[0]) & (data['attempted'] <= size_range[1])]

def fuzzy_accuracy_distribution(data: pd.DataFrame, ax: Optional[Axes] = None) -> None:
sns.histplot(
data=data,
x='fuzzy_accuracy',
bins=range(50, 101, 2),
ax=ax,
)

def accuracy_distribution_by_attempt_size(data: pd.DataFrame) -> None:
fig, axes = plt.subplots(2, 2)
plt.subplots_adjust(hspace=0.7, wspace=0.4)

fuzz_data = add_fuzziness(data)

fuzzy_accuracy_distribution(filter_by_attempt_size(fuzz_data, (3, 10)), axes[0][0])
fuzzy_accuracy_distribution(filter_by_attempt_size(fuzz_data, (11, 20)), axes[0][1])
fuzzy_accuracy_distribution(filter_by_attempt_size(fuzz_data, (21, 30)), axes[1][0])
fuzzy_accuracy_distribution(filter_by_attempt_size(fuzz_data, (31, 66)), axes[1][1])

axes[0][0].set_title('3-10 cubes')
axes[0][1].set_title('11-20 cubes')
axes[1][0].set_title('21-30 cubes')
axes[1][1].set_title('31-66 cubes')

plt.show()
Fuzzy accuracy distributions for various attempt size ranges.

Interestingly, we see that those attempting smaller number of cubes are the main contributors to the flat distribution we saw earlier. There is a peak that becomes more prominent as the attempts grow larger, with 21–30 and the 31+ ranges showing very clear unimodal distributions.

It is also interesting that the peak for 21–30 cubes is around 80%, whereas the peak for 31+ cubes is a little higher, around 85%. This supports the idea that there is an underlying skill. A competitor that has the speed and capacity for larger attempts is also going to have better accuracy. We will revisit this hypothesis a little later.

Speed

A cheetah running past cubes. Image by Midjourney.

Finally, we consider speed.

def speed_distribution(data: pd.DataFrame) -> None:
sns.histplot(
data=data,
x='timePerCube',
bins=range(0, 601, 10),
)
plt.show()
Distribution of speed (seconds taken per cube attempted)

The first thing that stands out is the 10 minute time limit per cube clearly visible on the far right. Other time limiting factors are less obvious, but become clear if we plot specific attempt sizes separately.

def speed_distribution_subplot(data: pd.DataFrame, ax: Optional[Axes] = None) -> None:
sns.histplot(
data=data,
x='timePerCube',
bins=range(0, 601, 10),
ax=ax,
)

def speed_distribution_by_attempt_size(data: pd.DataFrame) -> None:
fig, axes = plt.subplots(2, 2)
plt.subplots_adjust(hspace=0.7, wspace=0.4)

speed_distribution_subplot(filter_by_attempt_size(data, (13, 13)), axes[0][0])
speed_distribution_subplot(filter_by_attempt_size(data, (17, 17)), axes[0][1])
speed_distribution_subplot(filter_by_attempt_size(data, (25, 25)), axes[1][0])
speed_distribution_subplot(filter_by_attempt_size(data, (33, 33)), axes[1][1])
axes[0][0].set_title('13 cubes')
axes[0][1].set_title('17 cubes')
axes[1][0].set_title('25 cubes')
axes[1][1].set_title('33 cubes')

plt.show()
Distribution of speed for attempts of 13, 17, 25 and 33 cubes.

In general, the speed distribution becomes very narrow and bunched up right against the time limit. In general, it appears that people are limited by speed, not by memory capacity, and will do as many cubes as they can fit into an hour. This effect becomes more pronounced as the number of cubes increases.

Another way to look at speed is to consider not the time per cube, but the inverse — the number of cubes that could potentially be solved in an hour at that speed, assuming perfect accuracy and, of course, that the same speed could be maintained for a larger attempt (which is certainly not true, but still interesting to consider).


def cubes_per_hour_distribution(data: pd.DataFrame) -> None:
data['could_solve_per_hour'] = 3600 / data.timePerCube
sns.histplot(
data=data,
x='could_solve_per_hour',
bins=range(0, 201, 5),
)
plt.show()
Distribution of the number of cubes that could be solved in an hour, assuming constant speed.

You can’t really make out the tail on that, so let’s zoom in.

Tail end of the distribution of the number of cubes that could be solved in an hour.

While no-one has ever attempted more than 66 cubes in an official result, there are many attempts of 2 to 5 cubes with a much faster pace. At the extreme, Stanley Chapel managed to achieve 20.5 seconds per cube on a 2 cube attempt, a pace that would allow him to solve 175 cubes in the hour if it were possible to maintain. Yet his fastest pace per cube on a large attempt has been 64 seconds per cube, which allows him a maximum of 55 cubes in the hour.

We will revisit the idea of people choosing to attempt much smaller numbers of cubes than they are capable of in the next section.

Going Multivariate

Let’s combine the three dimensions of speed, accuracy and attempt size. We’ll start with a pair plot to get an overview of the data.

def pair_plot(data: pd.DataFrame) -> None:
sns.pairplot(
data=data,
vars=['solved', 'attempted', 'accuracy', 'timePerCube'],
hue='accuracy',
palette='viridis_r',
)
plt.show()

We can see some interesting structure when looking at speed for the number of cubes solved. Let’s have a closer look.

def speed_vs_solved_by_accuracy_scatter(data: pd.DataFrame) -> None:
sns.scatterplot(
x='solved',
y='timePerCube',
hue='accuracy',
data=data,
palette='viridis_r',
)
plt.xlim(0, 70)
plt.ylim(0, 600)

plt.show()
Scatter plot of time taken per cube against number of cubes solved, coloured by accuracy.

Hopefully you also notice that interesting “A” shaped structure. The right leg consists of those who are constrained by the time limit. This is even something we could derive an analytical expression for:

where t is the time per cube, n is the number of cubes solved, and a is the accuracy. We could even plot this to confirm it matches the right leg of the plot above.

def predicted_speed_vs_solved_by_accuracy_scatter(data: pd.DataFrame) -> None:
t = lambda n, a: min(3600 * (a/100) / n , 600 / (a/100))

approx = pd.DataFrame([{
'solved': n,
'accuracy': a,
'timePerCube': t(n, a),
} for n in range(2, 66) for a in range(50, 101, 5)])

sns.scatterplot(
x='solved',
y='timePerCube',
hue='accuracy',
data=approx,
palette='viridis_r',
)
plt.xlim(0, 70)
plt.ylim(0, 600)

plt.show()
Predicted time per cube against number of cubes solved, coloured by accuracy, assuming that the time limit is reached.

The left leg is somewhat more interesting. These are people who were not constrained by the time limit, and are attempting fewer cubes than their speed would allow them to. Let’s look specifically at those who used less than 75% of the time available to them.

def speed_vs_solved_by_accuracy_scatter_without_time_limit(data: pd.DataFrame) -> None:
sns.scatterplot(
x='solved',
y='timePerCube',
hue='accuracy',
data=data[(data['seconds'] < 2700) & (data['timePerCube'] < 450)],
palette='viridis_r',
)
plt.xlim(0, 70)
plt.ylim(0, 600)

plt.show()
Time taken per cube against number of cubes solved, coloured by accuracy, for those that used less than 75% of the time available.

It’s a mix. Considering each result in isolation, we have no easy way of knowing whether these people are constrained by memory, or just choosing to do a much smaller attempt than they are capable of. We would probably need to look not only at attempts in isolation, but also at a person’s historical attempts to get an idea of their ability.

It’s a good idea to understand how large this population not making use of the time limit is.

def population_size_below_time_limit(data: pd.DataFrame, attempt_size: int) -> None:
time_threshold = 0.75
max_time = 3600 * time_threshold
max_time_per_cube = 600 * time_threshold

data_subset = data[(data['attempted'] >= attempt_size)]

count = len(data_subset)
count_below_time_limit = len(data_subset[
(data_subset['seconds'] < max_time) &
(data_subset['timePerCube'] < max_time_per_cube)
])

print(f'Number of attempts of at least {attempt_size} cubes: {count}')
print(f'Number of attempts of at least {attempt_size} cubes and well below time limit: {count_below_time_limit}')
print(f'Percentage: { 100 - count_below_time_limit / count * 100}%')

When we consider people doing at least 13 cubes, 97.2% are using at least 75% of the time available to them, compared to only 53.2% of those doing 12 cubes or fewer, or 37% doing 6 cubes or fewer. When we consider only attempts of at least 33 cubes, every one of the 951 attempts used more than 75% of the time limit.

This suggests again that the time limit is the major constraining factor at the top level, and never memory capacity.

But let’s return to those smaller attempts for a moment. We do need to consider those who are capable of very large attempts, but choose to do much smaller attempts instead. We’ve already mentioned Stanley Chapel, who solved 2/2 in 41 seconds — the fastest speed per cube, but this followed a string of 48 and 49 cube attempts that used the full hour. Attempts far smaller than a person’s maximum ability are not unusual. In fact, the majority of those attempting 6 cubes or fewer have the speed for far more cubes. But how common is it for people to attempt far fewer cubes than they are capable of? We will use a person’s largest prior attempt as a proxy for their maximum ability.

def add_largest_prev_attempt_column(data: pd.DataFrame) -> pd.DataFrame:
data.sort_values(by='attempt_num', inplace=True)
data.reset_index(inplace=True)
data['largest_prev_attempt'] = data.groupby('personId')['attempted'].transform(lambda x: x.shift(1).cummax())
return data


def attempt_size_vs_largest_prev_attempt_scatter(data: pd.DataFrame) -> None:
data = add_largest_prev_attempt_column(data)

sns.scatterplot(
x='largest_prev_attempt',
y='attempted',
hue='timePerCube',
data=data,
palette='viridis_r',
)
plt.show()
Number of cubes attempted vs largest previous attempt

We see a bunching up of points along the diagonal, and a distinctive gap in the lower right triangle with only a few scattered results. This suggests that people tend to push themselves towards larger attempts (although perhaps this is only clearly the case for those who have previously attempted 20 or more cubes).

Finally, let’s take one last look at the our earlier finding that speed and accuracy might be correlated. We plot a bivariate histogram of the time per cube and accuracy.

def speed_vs_accuracy_bivariate_histogram(data: pd.DataFrame) -> None:
sns.histplot(
data=add_fuzziness(data),
x='timePerCube',
y='fuzzy_accuracy',
)
plt.show()
Bivariate histogram of the time per cube and accuracy

It’s hard to see, but if we look at the darker region and squint a bit, it seems to form a diagonal blob going from the top left to bottom right. This suggests a trend supporting our earlier finding that speed and accuracy are correlated. Let’s see whether this is statistically significant. The data is not uniformly distributed, so the Pearson correlation coefficient is not appropriate here. We should rather be using the Spearman Rank correlation coefficient.

spearman_corr, p_value_spearman = stats.spearmanr(data['accuracy'], data['timePerCube'])
print(f"Spearman correlation coefficient: {spearman_corr}, p-value: {p_value_spearman}")

We get a Spearman correlation coefficient of about -0.13, with an extremely tiny p-value in the order of 10⁻⁷⁵. This means that the relationship that we observed is weak, but definitely there and highly statistically significant. There’s practically no way we’d see this correlation by chance.

Recap and Revisiting Our Assumptions

We initially assumed that there was one underlying skill factor that drove our three variables of accuracy, speed, and memory capacity. What we have found is that:

  • The one hour time limit dominates over memory capacity — those who solve fast tend to also be able to solve lots of cubes. However, there are some who do occasionally choose to do much smaller attempts than they are capable of.
  • Speed and accuracy are indeed correlated — faster solvers also tend to be more accurate on average. However, there is a lot of variance in accuracy, and overall, the distribution of accuracy is relatively flat, particularly for smaller attempts.

So far, we have not found anything that would suggest that this single underlying skill variable does not exist.

Lets try to come up with a few criteria for a good ranking system using what we have learned.

A Better Ranking System

Ultimately, the aim of multiple blindfolded solving is to solve as many cubes while blindfolded as possible. So the core of any ranking system should be built around the number of cubes successfully solved.

We also want to reward solvers for accuracy and speed. Accuracy is already a fraction between 0 and 1 (actually between 0.5 and 1 in practice, but close enough). We can define a speed related factor as the fraction of the available time used. Usually this will be be a fraction between 0 and 1, but occasionally it may exceed 1 if the competitor reaches the time limit and any 2 second penalties are applied. But these will be small.

Let’s take a first stab at a formula:

Let’s have a look at the top 5 using this scoring system.

+------+------------+----------------+-------+--------------------------+------------+
| Rank | WCAID | Result | Score | Competition | Date |
+------+------------+----------------+-------+--------------------------+------------+
| 1 | 2016SIGG01 | 62/65 in 57:47 | 61.41 | BlindIsBackLA2022 | 2022-06-26 |
| 2 | 2016CHAP04 | 2/2 in 0:41 | 58.54 | BuckeyeBigBrain2023 | 2023-01-14 |
| 3 | 2021OTAI01 | 5/5 in 4:18 | 58.14 | LetsQualifyKuwait2023 | 2023-05-19 |
| 4 | 2019EGGI02 | 2/2 in 0:45 | 53.33 | WC2023 | 2023-08-12 |
| 5 | 2007HESS01 | 52/56 in 56:44 | 51.07 | EmpireStateFallFocus2023 | 2023-10-15 |
+------+------------+----------------+-------+--------------------------+------------+

The world record remains unchanged, but the rest of the rankings change drastically. It seems a bit absurd that fast 2/2 attempts are enough to make the world’s top 5. Any ranking system will be subjective, but one consideration is that we do want to encourage larger attempts. This scoring system would definitely encourage competitors to favour small fast attempts in much the same way that the pre-2008 system encouraged small accurate attempts. We don’t really want that.

We can adjust the impact of the accuracy and time factors by introducing exponents into our equation:

Now we have to choose appropriate values for a and b. This could be a classic optimisation problem, but the fact that our ranking system is inherently subjective means that there is no correct cost function that we could use. Instead, we need to choose some objective to optimise for. But how?

Let’s take a step back and look at the point system that the WCA currently uses. While it does leave a bit to be desired, it does a fairly good job at encouraging large and accurate attempts without putting much emphasis on raw speed. We’re also not looking to do a drastic shake-up here. We just want to make some small improvements to the existing ranking system.

So, let’s arbitrarily choose as our goal minimising the number of drastic shake-ups in the rankings. We can do this by ranking all results using the WCA’s point system, and ranking them using our score calculated using some candidate values for a and b. We can sum the squared differences in rankings of all results, and call that our objective function that we want to minimise.

What are the bounds of a and b? Clearly we want to reward solving faster and punish poor accuracy. Negative exponents would achieve the opposite, so we should therefore set the minimum for both to at least 0.

Should the exponents be less than or greater than 1? If the exponent is less than 1, we push the value higher, and gradient becomes gentler closer to 100%. The value drops lower when the exponent is greater than 1, and the gradient is steepest at 100%.

Effect of exponents less than or greater than one.

For the fraction of time used, we have already determined that our first attempt at a formula (equivalent to both exponents equal to 1) put too much emphasis on raw speed. So we want to reduce the impact of using a very small fraction of the available time, which means that the time exponent b should be less than 1.

Accuracy on the other hand, is a little more nuanced. On the one hand, it could be argued that we might want to reward higher accuracy and punish even slight mistakes. On the other hand, we’ve seen from history that allowing some room for error encourages larger attempts. Any choice will be subjective, so we will err on the side of allowing our optimisation to be forgiving to mistakes if it needs to be, and we would prefer that it not punish mistakes any more than our first stab did. We will therefore also limit the accuracy exponent a to be less than or equal to 1.

Because the calculation of rank differences is inherently non-linear and discontinuous, we need to use a global optimisation function that is robust to these challenges. We’ll use differential evolution, partly because it is quite well suited for this sort of problem, but partly as an arbitrary choice because there are so many optimisation techniques available with no clear winner.

def add_rank_column(data: pd.DataFrame, column: str, ascending: bool = True) -> pd.DataFrame:
data.sort_values(by=column, inplace=True)
data[f'rank_{column}'] = data[column].rank(method='dense', ascending=ascending)

return data


def add_calculate_score(data: pd.DataFrame, accuracy_exponent: float, time_exponent: float) -> pd.DataFrame:
data['score'] = data.solved * \
data.accuracy.transform(lambda x: math.pow(x / 100, accuracy_exponent)) / \
data.fraction_time_used.transform(lambda x: math.pow(x, time_exponent))

return data


def cost_function(data: pd.DataFrame, accuracy_exponent: float, time_exponent: float) -> float:
data = add_calculate_score(data, accuracy_exponent, time_exponent)
data = add_rank_column(data, 'score', ascending=False)
sqr_error = (data.rank_score - data.rank_result) ** 2
return np.sqrt(sqr_error.sum())


def optimize_exponents(data: pd.DataFrame) -> Tuple[float, float]:
data = add_rank_column(data, 'result')
data['available_time'] = data.attempted.transform(lambda x: x * 600 if x < 6 else 3600)
data['fraction_time_used'] = data.seconds / data.available_time

bounds = [(0, 1), (0, 1)]

result = differential_evolution(
lambda x: cost_function(data, x[0], x[1]),
bounds,
maxiter=1000,
tol=1e-4,
)

if not result.success:
raise Exception('Failed to optimize: ' + result.message)

print("Accuracy exponent:", result.x[0], "Time exponent:", result.x[1])

return result.x

When we run this, we get the following:

Accuracy exponent: 1.0000, Time exponent: 0.4359

That time exponent is a little messy. The value of 0.4359 minimises shake-ups from the existing ranking system. To be honest, we can tolerate a little shake-up if it makes the formula a little more elegant. So, we’ll nudge the time exponent up to 0.5 to get the following formula:

Evaluation

Before we dive into the real rankings, let’s assess whether our system makes sense intuitively. First, let’s go back to the 10 example results we introduced near the beginning of the article.

+----------------+-----------------+-----------------+
| Result | WCA system rank | New system rank |
+----------------+-----------------+-----------------+
| 31/55 in 53:34 | 2 | 1 |
| 29/52 in 56:16 | 9 | 2 |
| 15/24 in 54:10 | 7 | 3 |
| 14/21 in 55:05 | 4 | 4 |
| 9/11 in 54:37 | 3 | 5 |
| 8/9 in 52:37 | 1 | 6 |
| 7/7 in 56:59 | 5 | 7 |
| 6/6 in 53:15 | 6 | 8 |
| 7/8 in 55:57 | 8 | 9 |
| 6/6 in 57:44 | 10 | 10 |
+----------------+-----------------+-----------------+

These results were deliberately chosen to appear “wrong” in some sense when ranked using the WCA system. Using our proposed scoring system, we find that the order makes much more sense, and is much closer to how one might intuitively rank the results based on how difficult they might be perceived to achieve.

While the WCA point system placed all attempts of an odd number of cubes above the even-size attempts, this proposed system gives a good mix of cube sizes.

We see some considerable jumps though. The second worst by the WCA point system jumps upward to second place, and the best in the WCA point system drops to sixth place. Let’s look at how these jumps occur in the real rankings.

The Biggest Movements

We’ll start by looking at the upward movements. The worst results in the WCA database are those with 0 points using the full hour. Graham Siggins is usually known for having the top ranked result in the WCA database, but he also has the bottom ranked result, 32/64 (0 points), using the full hour, with a +2 second penalty due to a cube being one move away from solved. He also has another 32/64 without the penalty in the full hour.

Using the new scoring system, 32 cubes solved in the hour with 50% accuracy is equivalent to 16 cubes solved with 100% accuracy, so these results jump up considerably — from the very bottom of the rankings into the top 12% of results. Other upward jumps are Kamil Przybylski’s 24/48, Mark Boyanowski’s 21/42, Maxime Madrzyk’s 19/38, and Rowe Hessler’s two 31/60 attempts.

These large attempts with near 50% accuracy make up the majority of the big upward movers. The largest upward jump on a small attempt is Stanley Chapel’s 2/2 in 41 seconds, which jumps from 2 points to a score of 10.82 in this proposed scoring system, jumping from the 65th percentile to the 22nd percentile. Perhaps not ideal, but it’s better than jumping all the way into second place.

The biggest downward drops are those 2/2 solves using close to the full 20 minutes available. Generally, these beat large but inaccurate attempts using the WCA scoring system because, as we have shown, people doing larger attempts tend to aim to use as much of the time available as they can. This means that the majority of large attempts that result in 2 points or less will use much more than 20 minutes, meaning that 2/2 results tend to be ranked higher than larger 2 point attempts in the WCA point system. However, as the proposed system rewards larger attempts, the 2/2 results that are not fast enough to benefit from speed tend to lose out in the new rankings.

High scoring attempts tend to move much less. Of those attempts with 10 or more points in the WCA system, the largest movement is 13.3% (Stanley Chapel’s 31/52 in 1:00.00), and for those with more than 20 points, the largest movement is just 2.7% (Graham Siggin’s 41/62 in 1:00.00).

Top 10 Rankings

Let’s look at how the top 10 rankings change. These results were correct as of 11 December 2023 when the data for this analysis was pulled.

+------+----------+--------+------------+----------------+-----------------------------+------------+
| Rank | WCA Rank | Change | WCA ID | Result | Competition | Date |
+------+----------+--------+------------+----------------+-----------------------------+------------+
| 1 | 1 | | 2016SIGG01 | 62/65 in 57:47 | BlindIsBackLA2022 | 2022-06-26 |
| 2 | 2 | | 2007HESS01 | 52/56 in 56:44 | EmpireStateFallFocus2023 | 2023-10-15 |
| 3 | 3 | | 2015CHEN49 | 51/54 in 57:49 | MayMBLDMadnessSingapore2023 | 2023-05-20 |
| 4 | 4 | | 2011BANS02 | 48/48 in 59:48 | DelhiMonsoonOpen2018 | 2018-07-22 |
| 5 | 6 | ↑1 | 2013BOBE01 | 51/55 in 58:06 | SzansaCubingOpenWarsaw2022 | 2022-09-17 |
| 6 | 5 | ↓1 | 2016CHAP04 | 49/51 in 57:50 | OhioStateSummerSolving2023 | 2023-06-10 |
| 7 | 7 | | 2019KOBE03 | 48/50 in 58:21 | PBQPickering2022 | 2022-10-15 |
| 8 | 9 | ↑1 | 2014BOYA01 | 47/50 in 54:18 | WC2019 | 2019-07-11 |
| 9 | 8 | ↓1 | 2016PRZY01 | 46/46 in 59:13 | PolishChampionship2022 | 2022-07-08 |
| 10 | 10 | | 2021TRIP01 | 46/49 in 56:52 | TasmanianOpen2022 | 2022-03-25 |
+------+----------+--------+------------+----------------+-----------------------------+------------+

We see a little movement, but not much at this level. Krzysztof Bober’s 51/55 just nudges past Stanley Chapel’s 49/51 and Mark Boyanowski’s 47/50 is fast enough to pass Kamil Przybylski’s more accurate but slower 46/46. This is good! We wanted to reward bigger, faster attempts and be a little more lenient with accuracy.

World Record History

If we look at the world record history, there are only a couple of small changes. Mark might be pleased at having climbed a place in the world rankings, but he would surely be disappointed to learn that his 2018 world record of 43/44 using the full hour was just a little too slow to beat Maskow’s perfectly accurate and incredibly fast 41 cube attempt in just 54 minutes in 2013.

On the other hand, Tim Habermaas would likely be pleased to be gaining a world record under the new format. In 2008, he set the world record of 24/24 in a little over 2 hours, but this was on the old format prior to the introduction of the hour time limit. Unfortunately, he never managed to beat the world record under the new format, despite coming close on several occasions. Under this new proposed system, his 5/6 in under 21 minutes in 2008, despite not having perfect accuracy, was more than fast enough to beat the world record at the time of 6/6 in about 48 minutes.

Dennis Strehlau would also add one more world record to his name. His 5/5 in 24 minutes beat Tim’s 5/6 in 21 minutes, giving him a world record 5 months sooner than his 8/8 in 58 minutes.

Gaming the System Part 1: Accuracy

A common strategy amongst beginners aiming to get a successful result on their profile, regardless of the points, is to submit 4 cubes, and then memorise and solve the 2 easiest cubes. Since they need to solve a minimum of 2 cubes anyway, and submitting 4 cubes gives them an extra 20 minutes and the margin to put aside more difficult scrambles, there is no downside for the competitor.

A system that is too lenient on accuracy could encourage competitors to submit cubes with no intention of attempting to solve all of them. This puts strain on organisers as it takes time to scramble and check all of the submitted cubes.

How does the proposed scoring system fair up against these strategies? Firstly, because the rules for what constitutes a success haven’t changed, this strategy remains just as valid as with the WCA’s current ranking system. Secondly, we’ve already seen that large attempts with low accuracy are the biggest winners with the new system.

But let’s consider the general strategy where a competitor aims to solve N cubes, but attempts more cubes so that they can filter scrambles and choose the N easiest ones.

Let’s consider a result of N/N in a time x seconds. How fast would a result of N/2N need to be in order to beat the N/N result? The answer, perhaps surprisingly, differs depending on the value of N. If N is 2 or 3, the answer is x/2. If N is 4 or 5, then the answer is 0.375x or 0.3x, and for any value of N of 6 or more, the answer settles to x/4. This comes about because of the changing time limits for the various size attempts. So, in order for a 2/4 attempt to beat a 2/2 attempt, the two cubes would need to be solved in half the time.

Out of the 610 solves with a 2/4 result in the WCA database, only two were fast enough to beat the slowest 2/2 results using the proposed scoring system. Only four 3/6 results out of 372 were fast enough to beat the slowest 3/3 result, and just one 4/8 result out of 199 was fast enough to beat the slowest 4/4, but it was by Marcin Kowalczyk, who held the world record at the time.

So while the proposed system is more forgiving of lower accuracy, it still requires considerable skill and speed to take advantage of this to improve in the rankings.

Gaming the System Part 2: Speed

If strategies that involve sacrificing accuracy for the sake of being able to filter scrambles should be discouraged because they strain organisers due to the extra cubes that need to be scrambled, what about strategies that lead to competitors attempting fewer cubes? Although we designed the system with the opposite in mind, we should see how small attempts fare.

One strategy that the new system rewards to some extent is doing smaller attempts and focusing on speed. The WCA’s point system caps an attempt of N cubes at N points. A 2/2 attempt, no matter how fast, will never be able to beat a 3/3 attempt.

We have already seen how this proposed scoring system causes Stanley Chapel’s 2/2 in 41 seconds to jump from the third to first quartile. It’s worthwhile to take this as a benchmark, and consider how fast various N/N results need to be to beat this.

It can be beaten by a 3/3 result 2:18, or a 4/4 result in 5:28. At this point, we’re already slower than a minute per cube, which is slower than the world record pace over 65 cubes. So any world class competitor attempting 4 cubes should be able to comfortably beat the score of Stanley’s fastest 2/2. As we continue with larger attempts, any 11/11 result, regardless of the time, would beat this result.

But is Stanley’s 41 second 2/2 really as fast as it can get? the WCA’s current system does not encourage small attempts. Let’s consider world class blindfolded solvers all attempting the fastest 2/2 solves they can, and consider hypothetically a result of 2/2 in 24 seconds. This would put the average pace just a little faster than the world record for a single blindfolded solve. Just 5 cubes is needed to be able to beat this result at a pace of a cube per minute, and any 15/15 result would beat it. Graham’s 32/64, the worst result in the WCA’s point system, would still beat this score.

Let’s consider now N/N at a pace of 1 minute per cube. How much extra time per cube could your opponents afford to spend? If you attempt 2 cubes at 1 minute per cube, your opponent can spend up to 4 minutes per cube to get perfect accuracy in order to beat you. An opponent attempting 6 cubes can afford to spend 9 minutes per cube.

The reason that these extremely fast attempts benefit so much is that they use such a small proportion of the available time. As the proportion of available time used increases, it becomes much harder to take advantage of speed. The difference between a 57 minute and 60 minute attempt depends far more on the competitor’s mental state (focus) at the time, average scramble difficulty, and recall speed than on any deliberate choice by the competitor.

Overall, finishing the attempt in 54 minutes gives a roughly 5% advantage to the score over a similar attempt finishing in 60 minutes. At the world class level of 60 cubes, this is equivalent to solving 3 more cubes with the same accuracy. When the pace is a minute per cube, it’s more advantageous for these competitors to use the full hour and solve 6 more cubes, rather than aim to finish within 54 minutes.

To beat the world record with perfect accuracy, using a speed advantage alone, a 59/59 attempt would need to be completed in 57:30 (Siggin’s imperfect accuracy is roughly cancelled by his 3 extra cubes solved). A 55/55 would need to be completed in 49:58 to beat the world record, and a 42/42 would need to come in under 29:08.

So while it is possible to leverage speed to gain a considerable advantage at the low level, it is not feasible to take advantage of speed at the top level. There is more to be gained by adding more cubes.

Closing Thoughts

This article has proposed a new system for ranking attempts at solving multiple Rubik’s cubes while blindfolded.

This system results in scores that align better with an intuitive “feel” for what constitutes a good result, and rewards speed better than the WCA’s current system, without punishing slight lapses in accuracy.

I do not, however, suggest that the WCA change it’s ranking system. Despite it’s simplicity, the fact that the current point system uses only subtraction allows one to compare two results using only a little simple mental arithmetic. While I’m relatively comfortable subtracting one and two digit numbers in my head, I, like many others, need help from a calculator for division. I can certainly say there is no way I will be calculating the square root of a fraction on the fly.

The reality is, any ranking system is going to be somewhat subjective anyway.

--

--

Brendan Gray

I'm a seasoned software developer with a passion for crafting elegant code. I solve Rubik's cubes, play D&D, and enjoy over analysing everything.