At the time of writing I'm using Linux Mint 22.1 Cinnamon. PySpark 4.0.0 has just been released, and I work in either VS Code or PyCharm. The installation is not as straightforward as it seems.
Update: If you want to use pip instead of pipx, see my next post.
If you tried following the standard install (pip install pyspark
), you probably received the following error message
error: externally-managed-environment
× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
python3-xyz, where xyz is the package you are trying to
install.
If you wish to install a non-Debian-packaged Python package,
create a virtual environment using python3 -m venv path/to/venv.
Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
sure you have python3-full installed.
If you wish to install a non-Debian packaged Python application,
it may be easiest to use pipx install xyz, which will manage a
virtual environment for you. Make sure you have pipx installed.
See /usr/share/doc/python3.12/README.venv for more information.
note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.
Long story short, Windows developers (and users) may remember "DLL hell", where one installer wrecks the installation environment created by another installer. Linux users will use both apt and pip for installing packages, which in the past caused issues if they overwrote or removed dependencies managed by the other installer. There's a good explanation with more links in the first answer for pip install -r requirements.txt is failing: "This environment is externally managed".
The easiest way to install PySpark (or any pip installed package) on Mint is to install and use pipx, which creates a venv (virtual environment) for the PySpark installation. Two advantages of using pipx is the virtual environment is automatically created for you, and you don't have to actvate the environment to use it. Instead of activating, you just need to configure the python interpreter in your IDE, which involves rooting around in a hidden folder. Do this one though, and you're good to go.
Pipx can be installed either using Mint's package manager, or using the Ubuntu commands at https://pipx.pypa.io/stable/installation/#on-linux.
Once pipx is installed, you just need to run pipx install pyspark
to install pyspark in its own virtual environment. The nice thing about venvs is you can have multiple versions living side-by-side, so you can test your code between 3.5.5 and 4.0.0.
You don't activate pipx created venvs like you do traditional ones. Instead, you configure the venv python as the interpreter in your IDE. In PyCharm, create a new project and add a .py file with the following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print(spark.version)
You should see a warning about pyspark being a missing dependency, which it is to the default python interpreter. In PyCharm, look in the lower right hand corner for the python version and click it. Then choose to add a local interpreter.
We're going to use an existing interpreter, and navigate to the pyspark venv.
Pipx creates virtual environments in a hidden folder (similar to how Windows usses ApPData). Click the "eyeball" button to show hidden files.
Once the hidden folders are displayed, drill down home/{username}/.local/share/pipx/venvs/pyspark/bin, and select the proper version of the python interpreter, and OK back to the top. Pyspark shold no longer be seen as a missing dependency and you can run the little sample. If the PySpark version is outputted, it's go time! If you need to, you can switch back to the default interpreter by clicking the selected one in the lower right hand corner.
For VS Code, the process is similar. Open (or create) the python file we used above in VS Code. Again, in the lower right corner is the current interpreter. The codefile language is close by, so it can get a little confusing.
We'll choose to enter an interpreter path, and then browse to the venv.
A file window will open, and once again we need to show hidden files. Right-click in the whitespace, and select "Show hidden files". Then, navigate to the same folder as before, select the python interpreter and OK back to the top.
You should now be able to run our sample code. If the version prints out, it's go time!