Monday, 7 June 2010

Analyzing PyPI packages

PyPI, which stands for Python Package Index is a global repository for Python
packages. Every time you need a Python tool or library, you can simply type
easy_install mypackage, and have it downloaded and
installed for you. It is also a great source when trying to investigate current
practices in the Python world.


There are couple of troubles when analyzing PyPI. First - it is a moving target.
Since I first run the download script (which was 3 days ago), it grew by 20 new
packages. So, please bear in mind, this information won't very exact. Still, it
provides a nice overview. Second - not all packages are hosted in PyPI. For some
(quite a lot, actually) cases, we only get a link to the actual download source.
This grows the chance of a host being, and causes the download to fail. Third -
PyPI packages are terribly diverse. In order to analyze it in a timely manner, I
picked only the ones that could be downloaded as either tarballs or zips. This
reduced the sample by a quarter (from 10112 to 7625), which I believe is still a
representative enough. usage

Most of the packages (96%) used The rest either simply didn't use it
or used a non-standard directory layout (accordingly: 187 and 47). Out of users, setuptools was more than three time more popular than standard
distutils. 73 packages couldn't be identified as using either of these, and this
is mostly caused by custom setup function wrappers (see 4Suite for example of

Test runners

I was curious, how people run their tests, so I identified several ways it could
be done:
  1. using a top-level shell script: 20
  2. using a top-level python script: 326
  3. using setuptools' test command: 961

Note: these stats don't include another popular way of running tests, used by
Django apps.

There where 1048 packages having a toplevel directory containing string "test",
among which the most popular varations were unsurprisingly "test" (477) and


Michael Foord said...

I think Zope and Plone badly skew any analysis of PyPI like this - simply by virtue of having been split into about a kabillion different packages (all of which use setuptools).

I'd be interested in the same analysis, but skipping zope / plone stuff. I think we'd see that setuptools usage is much lower amongst the other packages (although still significant of course).

Konrad said...

Zope itself is split into <100 packages, so it doesn't make a big difference and Plone, for some reason, doesn't include at all.

What *does* make a difference, however, is a huge number of packages targeting zope. By looking at the data it makes more or less one third of all setuptools users. It looks like there is a convention in Zope community to split applications into separate pypi packages (that's what many Zope users tend to do as least).

On the other hand, there's quite a lot of Django setuptools users as well (~300).

Michael Foord said...

Interesting - thanks. So even without Zope / Plone / Django setuptools is still pretty dominant.

The "browse PyPI page" shows the following numbers for Zope, Plone and Django related projects:

Zope2 (531)
Zope3 (915)
Plone (1136)
Django (554)

This is out of a total of over 7000 packages - but 1136 targeting Plone is still quite a number! (Obviously there will be some overlap between Zope2, Zope3 and Plone - but still much higher than Django. You're right that there is a strong convention of splitting Zope apps into multiple PyPI packages.)

Michael Foord said...

As for tests - I usually put my tests inside a 'test' directory in my package. Using test discovery from unittest2 (or nose or py.test) there is no need to provide a top level script to run them. From my reading of your post you were only looking for top-level directories (?) so would miss projects that follow this practise.

Rok Garbas said...

i would actually agree with analysis konrad made for using setuptools. and yes setuptools is the de facto tool to use now for packaging. probably even better would be distribute which is like distribute with more patches applied.

but for best practices to be i would also consider looking into future of python packaging with distutils2, since lately there is quite big movement in this area, especially after tarek took over the maintaining it.

and for the zope and plone part. you should only count out the packages that start with and, since this are the packages which are intended to be used in zope or plone app. all others were just developed by plone/zope developers and are just using plone/zope namespaces, but are not make for plone/zope only, so you shouldnt count them in i think, right?

Konrad said...

@Rok In fact, distutils2 is the original reason I run these stats. I am taking part in GSOC this year, helping Tarek move it forward (along with 5 other students).

Tarek said...

@Rok: you have to count all plone.* and zope.* packages in fact, because if I want to have my code running ain a plone app, I am required to use setuptools.

But with PEP 345 (dependencies declarations) and PEP 382 (namespace packages) being implemented, this won't be needed anymore.