Skip to content

Commit 4e8ff33

Browse files
committed
Add pony factor (#4)
A rudimentary analysis of pony factor using merged PRs as the metric. Also includes a visualization of the distribution of number of merged PRs per unique contributor.
1 parent 4ed242e commit 4e8ff33

1 file changed

Lines changed: 128 additions & 1 deletion

File tree

site/numpy_timeseries.md

Lines changed: 128 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ import functools
2323
import datetime
2424
from dateutil.parser import isoparse
2525
import warnings
26+
from collections import defaultdict
2627
2728
import numpy as np
2829
import matplotlib.pyplot as plt
@@ -47,7 +48,7 @@ output_notebook()
4748

4849
%TODO improve handling of datetimes (super annoying)
4950

50-
A snapshot of the development on the NumPy project since {glue:text}`query_date`
51+
A snapshot of the development on the NumPy project.
5152

5253
## Issues
5354

@@ -259,12 +260,29 @@ tags: [hide-input]
259260
with open("../_data/prs.json", "r") as fh:
260261
prs = [item["node"] for item in json.loads(fh.read())]
261262
263+
### Filters
264+
262265
# Only look at PRs to the main development branch - ignore backports, gh-pages,
263266
# etc.
264267
default_branches = {"main", "master"} # Account for default branch update
265268
prs = [pr for pr in prs if pr["baseRefName"] in default_branches]
269+
270+
# Drop data where PR author is unknown (e.g. github account no longer exists)
271+
prs = [pr for pr in prs if pr["author"]] # Failed author query results in None
272+
273+
# Filter out PRs by bots
274+
bot_filter = {"dependabot-preview"}
275+
prs = [pr for pr in prs if pr["author"]["login"] not in bot_filter]
266276
```
267277

278+
The following filters are applied to the PRs for the following analysis:
279+
- Only PRs to the default development branch (e.g ``main``)[^master_to_main]
280+
are considered.
281+
- Only PRs from users with *active* GitHub accounts are considered. For example,
282+
if a user opened a Pull Request in 2016, but then deleted their GitHub account
283+
in 2017, then this PR is excluded from the analysis.
284+
- PRs opened by dependabot are excluded.
285+
268286
### Merged PRs over time
269287

270288
A look at merged PRs over time.
@@ -392,3 +410,112 @@ p.yaxis.axis_label = "PR lifetime (hours)"
392410
p.scatter(x=num_participants, y=lifetimes.astype(int), size=9, alpha=0.4)
393411
show(p)
394412
```
413+
414+
### Where contributions come from
415+
416+
There have been a total of {glue:text}`num_merged_prs_with_known_authors`
417+
merged PRs[^only_active] submitted by {glue:text}`num_unique_authors_of_merged_prs`
418+
unique authors. {glue:text}`num_flyby` of these are "fly-by" PRs, i.e.
419+
PRs from users who have contributed to the project once (to-date).
420+
421+
422+
```{code-cell} ipython3
423+
---
424+
tags: [hide-input]
425+
---
426+
427+
# Remap PRs by author
428+
contributions_by_author = defaultdict(list)
429+
for pr in merged_prs:
430+
author = pr["author"]["login"]
431+
contributions_by_author[author].append(pr)
432+
433+
num_merged_prs_per_author = np.array(
434+
[len(prs) for prs in contributions_by_author.values()]
435+
)
436+
437+
num_flybys = np.sum(num_merged_prs_per_author == 1)
438+
439+
glue("num_merged_prs_with_known_authors", len(merged_prs))
440+
glue("num_unique_authors_of_merged_prs", len(contributions_by_author))
441+
glue("num_flyby", percent_val(num_flybys, len(num_merged_prs_per_author)))
442+
```
443+
444+
```{code-cell} ipython3
445+
---
446+
tags: [hide-input]
447+
---
448+
449+
title = "Distribution of number of merged PRs per contributor"
450+
451+
x = ["1", "2", "3", "4", "5", "6 - 10", "10 - 20", "20 - 50", "> 50"]
452+
bedges = np.array([0, 1, 2, 3, 4, 5, 10, 20, 50, sum(num_merged_prs_per_author)]) + 0.5
453+
y, _ = np.histogram(num_merged_prs_per_author, bins=bedges)
454+
455+
p = figure(
456+
x_range=x,
457+
y_range=(0, 1.05 * y.max()),
458+
width=670,
459+
height=400,
460+
title=title,
461+
tooltips=[(r"# PRs merged", "@x"), ("# contributors", f"@top")],
462+
)
463+
p.vbar(x=x, top=y, width=0.8)
464+
p.xaxis.axis_label = "# Merged PRs per user"
465+
p.yaxis.axis_label = "# of unique contributors with N PRs merged"
466+
show(p)
467+
```
468+
469+
#### Pony factor
470+
471+
Another way to look at these data is in terms of the
472+
[pony factor](https://ke4qqq.wordpress.com/2015/02/08/pony-factor-math/),
473+
described as:
474+
475+
> The minimum number of contributors whose total contribution constitutes a
476+
> majority of the contributions.
477+
478+
For this analysis, we will consider merged PRs as the metric for contribution.
479+
Considering all merged PRs over the lifetime of the project, the pony factor
480+
is: {glue:text}`pony_factor`.
481+
482+
% TODO: pandas-ify to improve sorting
483+
484+
```{code-cell} ipython3
485+
---
486+
tags: [hide-input]
487+
---
488+
# Sort by number of merged PRs in descending order
489+
num_merged_prs_per_author.sort()
490+
num_merged_prs_per_author = num_merged_prs_per_author[::-1]
491+
492+
num_merged_prs = num_merged_prs_per_author.sum()
493+
pf_thresh = 0.5
494+
pony_factor = np.searchsorted(
495+
np.cumsum(num_merged_prs_per_author), num_merged_prs * pf_thresh
496+
)
497+
498+
fig, ax = plt.subplots()
499+
ax.plot(np.cumsum(num_merged_prs_per_author), ".")
500+
ax.set_title(f"How the pony factor is calculated")
501+
ax.set_xlabel("# unique contributors")
502+
ax.set_xscale("log")
503+
ax.set_ylabel("Cumulative sum of merged PRs / contributor")
504+
ax.hlines(
505+
xmin=0,
506+
xmax=len(contributions_by_author),
507+
y=num_merged_prs * pf_thresh,
508+
color="tab:green",
509+
label=f"Pony factor threshold = {100 * pf_thresh:1.0f}%",
510+
)
511+
ax.legend();
512+
513+
glue("pony_factor", pony_factor)
514+
```
515+
516+
% TODO: Add:
517+
% - Augmented pony factor (only consider contributors active in a time window)
518+
% - pony factor over time, e.g yearly bins
519+
520+
[^master_to_main]: i.e. ``master`` or ``main``.
521+
[^only_active]: This only includes PRs from users with an active GitHub account.

0 commit comments

Comments
 (0)