Get dependent files python

Python adventures: Finding unused dependencies

This article is about an issue I got a couple of times when working on medium to largish Python projects and having to deploy them. It happens quite often that the dependencies are messy and whoever is, or was working on the project happened to install a bunch of libraries because they needed to add a new feature now. But when deployment comes, we want the environments or containers to be lean for multiple reasons, build time and pipelines latency being some of them. After some months working on the project, we are often left with many dependencies, not sure if they are needed or not.

For that reason, I tried to build a small python tool that would let me go inside a project and basically print out all the unused dependencies, which then become candidates for removal unless advised otherwise.

If that sounds interesting to you and you’d like a look at the code, you check it out on Github. In short, it allows you to use a small CLI tool and do:

 --filenames --recursive --depfiles python cli.py -f . -r  

inside a project, to go through the whole project recursively and check which of your dependencies are actually used or not. Then, you’ll get messages printed in the console:

'pandas' is probably unused. 'scipy' is probably unused. 

In the next parts of the post, I’ll break down the functions and functionalities to make that possible, we’ll talk about naming things, Abstract Syntax Trees (AST’s) and regular expressions.

Читайте также:  Remove content with css

What’s in a name¶

First, I think it’s important to know a bit about the packaging system in Python. Regardless of your package management strategies (conda, pip, poetry…), when you’re in a virtual environment, installed packages will be (almost every time) in the lib/python/site-packages/ directory. So anytime you do conda install , pip install or poetry add from within a virtual environment; this is the place where you’ll find the package’s files and metadata. There are some other ways to install packages, notably by cloning a repository, but we’ll focus on what most people do when they’re working with Python, which is using the tools I just mentioned.

This is where the naming part comes in, you might think that when you install a package named some-package , it becomes accessible as some-package as an import in your Python code. This is trivially wrong because Python does not allow hyphens in identifiers, so you should expect it to be some_package . Not as obvious, it is also possible that the package provides an entirely different name for imports, which can’t be found from the package name only. Notable examples are scikit-learn being available as sklearn , beautifulsoup4 being available as bs4 or pyyaml being yaml . If you check out the source code for all those projects, you’ll find that the name of the package which contains the code is the name used in imports but the name of the project is the one used for installs. In short, the name used for imports is the actual name of the directory containing the package’s code and the name of the package that is retrieved with package managers, is whatever it is called and published as. This is all well and good, but when we are interacting with dependencies, the project name is used, not the import name. So we need to find a strategy to link the two together.

Читайте также:  Body height and weight in html

When you install packages in the ways we mentioned, you get the package’s code but also its metadata, a package called stuff will also provide a stuff-.egg-info or stuff-.dist-info directory. Those directories sometimes contain a file called top_level.txt that gives the import names (yes, there are multiple sometimes!) for a package. A simple heuristic got me quite far: if the metadata provides a top_level.txt use it, and if that file does not exist then the package name and import name are the same.

Armed with this knowledge, we can build the first part of the tool which reads that information in a Python datastructure: a mapping of package name and its import names. Python’s standard library does a lot of the heavy lifting since it provides tools to find where packages are installed as well as manipulate paths and files. In the function below, I walk through the directories where dependencies are installed, grab the ones ending with -info and look for the top_level.txt . Depending on its presence, I apply the heuristic and then we have a neat mapping of dependencies and import names.

 The datastructure above will look like this:

Now that we have a better handle on the dependencies and their names, we still need to find which are actually used in our own code, which is detailed in the next part.

Playing in the trees¶

I think there are a couple ways to go about finding imports, they will all involve reading the Python files in our project of course. We could:

  • Iterate over lines in a file and try to find all the ones using ==.
  • Use regular expressions.
  • Use a method aware of the code’s structure.

I think the first and second methods have their merits but since they don’t ‘understand’ code structure, they are also quick to breakdown. Think about what happens when you don’t have top-level imports rather ones in functions or when there are multi-line imports. Dealing with strings and regular expressions quickly becomes a nightmare.

With that in mind, we can use Python’s own parser which has an intimate knowledge of the language’s grammar. The grammar of the language and the whole of a program structure are described by a datastructure called an abstract syntax tree (AST), for which Python also has a module, called ast . There are specific nodes of interest in a tree, namely Import and ImportFrom ; they contain all the information we need, so we’ll use just that to collect the names of the imports in a single set for the whole project.

In the function below, we go through each file’s AST and walk down the tree to find the nodes related to imports.

Now, we have a tool which gives us all the imports for as many files as we want and from the previous section, we know the dependencies. Our next job is to try to compare the dependencies with the imports.

There is a somewhat large caveat though. The logic above doesn’t have any way to differentiate between top-level dependencies and those dependencies' dependencies. Any installed package will be added to the mapping and comparing the imports with the mapping will not yield a very useful report because none (or few) of the installed packages dependencies are imported in our own code. To remedy that issue, we can use dependency files in the form of a requirements.txt , env.yml , pyproject.toml … Since those files document the actual, direct dependencies of a project, we can use those to filter down the dependencies mapping and get a more realistic output.

As you might have guessed, the next section talks about parsing and using those files.

Sometimes regular expressions are good enough¶

Python’s ecosystem can go a bit wild about defining dependencies so I’ll go with a large hammer, called regular expressions, to read dependency files. Here’s the function that does just that and gets packages' names from dependency files.

      ' and get only the pkg name '   There’s a lot going on here, since multiple file formats will have to be dealt with differently, even worse, some tools use the same file format but with a different structure. So even if this function looks a bit stuffy, it mostly contains file reading and regular expression logic. You’ll notice that the TOML files have their own function to parse them. The text files are parsed roughly, only getting the names of the packages on the left-hand side of symbols like [=<> . The YAML files are a bit trickier because multiple tools can be called from within the file: it’s possible to both add dependencies and call pip from inside the YAML. For each file type, we can then get the dependency name and add it to a set of all dependencies. Note that this function does not cover all possible edge cases, it’s merely a starting point.

Putting it all together¶

We have everything to build a function bringing all the functionalities together.

 The function iterates over all the dependencies with the names being the keys and the set of possible import names being the values. If none of the possible import names are found in the actual imports, then we know that the package is probably unused. The function works without any dependency file and will flag all installed packages but it is advised to use it with a dependency file that will narrow down the actual unused packages.

Making a CLI¶

Finally, we can build a small CLI that will make using the functionalities of the package a bit easier. It uses argparse and enforces some number of arguments on the more important parameters.

You might wonder why I decided to use glob.glob and not Path.glob to get the files; it is because as of Python 3.11, the globbing interface lets us ignore hidden directories or files which is handy because all the files in the .venv folder of the project shouldn’t be accessed.

 Hopefully, this has been an interesting delve into Python’s functionalities.

Источник

Finding Python package dependencies

To find a Python package’s dependencies, you can parse a package’s PyPi JSON endpoint. These endpoints can be found at:

You can parse the JSON at this URL with the requests package, as follows:

package_json = requests.get('https://pypi.org/pypi/seaborn/0.11.2/json', verify = False).json() package_json['info']['requires_dist']

If you run this for the Seaborn package, you’ll get: [‘numpy (>=1.15)’, ‘scipy (>=1.0)’, ‘pandas (>=0.23)’, ‘matplotlib (>=2.2)’]

But there’s an issue here. Some of Seaborn’s dependencies have dependencies themselves. They aren’t listed here. Luckily, there is a solution. Read on!

Find package dependencies with Pipdeptree

If you want to find all (including nested) dependencies: use pipdeptree, which is a command-line tool with many options for listing and visualizing dependencies.

pip install pipdeptree pipdeptree

Runnin the pipdeptree command in your terminal will print all of your packages’ dependencies. In the example below, I took a snapshot from my pipdeptree output and show what Seaborn’s dependencies look like.

seaborn==0.11.2 - matplotlib [required: >=2.2, installed: 3.5.1] - cycler [required: >=0.10, installed: 0.11.0] - fonttools [required: >=4.22.0, installed: 4.31.2] - kiwisolver [required: >=1.0.1, installed: 1.4.1] - numpy [required: >=1.17, installed: 1.22.3] - packaging [required: >=20.0, installed: 21.3] - pyparsing [required: >=2.0.2,!=3.0.5, installed: 3.0.7] - pillow [required: >=6.2.0, installed: 9.0.1] - pyparsing [required: >=2.2.1, installed: 3.0.7] - python-dateutil [required: >=2.7, installed: 2.8.2] - six [required: >=1.5, installed: 1.16.0] - numpy [required: >=1.15, installed: 1.22.3] - pandas [required: >=0.23, installed: 1.4.1] - numpy [required: >=1.21.0, installed: 1.22.3] - python-dateutil [required: >=2.8.1, installed: 2.8.2] - six [required: >=1.5, installed: 1.16.0] - pytz [required: >=2020.1, installed: 2022.1] - scipy [required: >=1.0, installed: 1.8.0] - numpy [required: >=1.17.3,

If you want to get pipdeptree’s output in a text file, use:

pipdeptree >> dependencies.txt

Congratulations, you can now find the dependencies of Python packages.

Источник

Оцените статью