@@ -23,6 +23,7 @@ import functools
2323import datetime
2424from dateutil.parser import isoparse
2525import warnings
26+ from collections import defaultdict
2627
2728import numpy as np
2829import matplotlib.pyplot as plt
@@ -47,7 +48,7 @@ output_notebook()
4748
4849%TODO improve handling of datetimes (super annoying)
4950
50- A snapshot of the development on the NumPy project since {glue : text } ` query_date `
51+ A snapshot of the development on the NumPy project.
5152
5253## Issues
5354
@@ -259,12 +260,29 @@ tags: [hide-input]
259260with open("../_data/prs.json", "r") as fh:
260261 prs = [item["node"] for item in json.loads(fh.read())]
261262
263+ ### Filters
264+
262265# Only look at PRs to the main development branch - ignore backports, gh-pages,
263266# etc.
264267default_branches = {"main", "master"} # Account for default branch update
265268prs = [pr for pr in prs if pr["baseRefName"] in default_branches]
269+
270+ # Drop data where PR author is unknown (e.g. github account no longer exists)
271+ prs = [pr for pr in prs if pr["author"]] # Failed author query results in None
272+
273+ # Filter out PRs by bots
274+ bot_filter = {"dependabot-preview"}
275+ prs = [pr for pr in prs if pr["author"]["login"] not in bot_filter]
266276```
267277
278+ The following filters are applied to the PRs for the following analysis:
279+ - Only PRs to the default development branch (e.g `` main `` )[ ^ master_to_main ]
280+ are considered.
281+ - Only PRs from users with * active* GitHub accounts are considered. For example,
282+ if a user opened a Pull Request in 2016, but then deleted their GitHub account
283+ in 2017, then this PR is excluded from the analysis.
284+ - PRs opened by dependabot are excluded.
285+
268286### Merged PRs over time
269287
270288A look at merged PRs over time.
@@ -392,3 +410,112 @@ p.yaxis.axis_label = "PR lifetime (hours)"
392410p.scatter(x=num_participants, y=lifetimes.astype(int), size=9, alpha=0.4)
393411show(p)
394412```
413+
414+ ### Where contributions come from
415+
416+ There have been a total of {glue: text }` num_merged_prs_with_known_authors `
417+ merged PRs[ ^ only_active ] submitted by {glue: text }` num_unique_authors_of_merged_prs `
418+ unique authors. {glue: text }` num_flyby ` of these are "fly-by" PRs, i.e.
419+ PRs from users who have contributed to the project once (to-date).
420+
421+
422+ ``` {code-cell} ipython3
423+ ---
424+ tags: [hide-input]
425+ ---
426+
427+ # Remap PRs by author
428+ contributions_by_author = defaultdict(list)
429+ for pr in merged_prs:
430+ author = pr["author"]["login"]
431+ contributions_by_author[author].append(pr)
432+
433+ num_merged_prs_per_author = np.array(
434+ [len(prs) for prs in contributions_by_author.values()]
435+ )
436+
437+ num_flybys = np.sum(num_merged_prs_per_author == 1)
438+
439+ glue("num_merged_prs_with_known_authors", len(merged_prs))
440+ glue("num_unique_authors_of_merged_prs", len(contributions_by_author))
441+ glue("num_flyby", percent_val(num_flybys, len(num_merged_prs_per_author)))
442+ ```
443+
444+ ``` {code-cell} ipython3
445+ ---
446+ tags: [hide-input]
447+ ---
448+
449+ title = "Distribution of number of merged PRs per contributor"
450+
451+ x = ["1", "2", "3", "4", "5", "6 - 10", "10 - 20", "20 - 50", "> 50"]
452+ bedges = np.array([0, 1, 2, 3, 4, 5, 10, 20, 50, sum(num_merged_prs_per_author)]) + 0.5
453+ y, _ = np.histogram(num_merged_prs_per_author, bins=bedges)
454+
455+ p = figure(
456+ x_range=x,
457+ y_range=(0, 1.05 * y.max()),
458+ width=670,
459+ height=400,
460+ title=title,
461+ tooltips=[(r"# PRs merged", "@x"), ("# contributors", f"@top")],
462+ )
463+ p.vbar(x=x, top=y, width=0.8)
464+ p.xaxis.axis_label = "# Merged PRs per user"
465+ p.yaxis.axis_label = "# of unique contributors with N PRs merged"
466+ show(p)
467+ ```
468+
469+ #### Pony factor
470+
471+ Another way to look at these data is in terms of the
472+ [ pony factor] ( https://ke4qqq.wordpress.com/2015/02/08/pony-factor-math/ ) ,
473+ described as:
474+
475+ > The minimum number of contributors whose total contribution constitutes a
476+ > majority of the contributions.
477+
478+ For this analysis, we will consider merged PRs as the metric for contribution.
479+ Considering all merged PRs over the lifetime of the project, the pony factor
480+ is: {glue: text }` pony_factor ` .
481+
482+ % TODO: pandas-ify to improve sorting
483+
484+ ``` {code-cell} ipython3
485+ ---
486+ tags: [hide-input]
487+ ---
488+ # Sort by number of merged PRs in descending order
489+ num_merged_prs_per_author.sort()
490+ num_merged_prs_per_author = num_merged_prs_per_author[::-1]
491+
492+ num_merged_prs = num_merged_prs_per_author.sum()
493+ pf_thresh = 0.5
494+ pony_factor = np.searchsorted(
495+ np.cumsum(num_merged_prs_per_author), num_merged_prs * pf_thresh
496+ )
497+
498+ fig, ax = plt.subplots()
499+ ax.plot(np.cumsum(num_merged_prs_per_author), ".")
500+ ax.set_title(f"How the pony factor is calculated")
501+ ax.set_xlabel("# unique contributors")
502+ ax.set_xscale("log")
503+ ax.set_ylabel("Cumulative sum of merged PRs / contributor")
504+ ax.hlines(
505+ xmin=0,
506+ xmax=len(contributions_by_author),
507+ y=num_merged_prs * pf_thresh,
508+ color="tab:green",
509+ label=f"Pony factor threshold = {100 * pf_thresh:1.0f}%",
510+ )
511+ ax.legend();
512+
513+ glue("pony_factor", pony_factor)
514+ ```
515+
516+ % TODO: Add:
517+ % - Augmented pony factor (only consider contributors active in a time window)
518+ % - pony factor over time, e.g yearly bins
519+
520+ [ ^ master_to_main ] : i.e. `` master `` or `` main `` .
521+ [ ^ only_active ] : This only includes PRs from users with an active GitHub account.
0 commit comments