Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-22519

Have python-archives also take tar.gz

    XMLWordPrintableJSON

Details

    Description

      python-archives currently only takes zip.

      In our use case, we want to package the whole conda environment into python-archives, similar to how the docs suggest about using venv (Python virtual environment). As we use PyFlink for ML, there are inevitably a few large dependencies (tensorflow, torch, pyarrow), as well as a lot of small dependencies.

      This pattern is not friendly for zip. According to the post, zip compresses each file independently, and it is not performing good when dealing with a lot of small files. On the other hand, tar simply bundles all files into a tarball, then we can apply gzip to the whole tarball to achieve smaller size. This may explain why the official packaging tool - conda pack   conda pack  produces tar.gz by default, even though zip is an option if we really want to.

      To further prove the idea, I use my laptop and conda env to run an experiment. My OS: macOS 10.15.7

      1. Create an environment.yaml as well as a requirements.txt
      2. Run `conda env create -f environment.yaml` to create the conda env
      3. Run conda pack to produce a tar.gz
      4. Run conda pack faetflow-ml-env.zip to produce a zip

      More details:

      environment.yaml

      name: featflow-ml-env
      channels: 
      - pytorch
      - conda-forge
      - defaults
      dependencies: 
      - python=3.7
      - pytorch=1.8.0
      - scikit-learn=0.23.2
      - pip
      - pip: 
      - -r file:requirements.txt
      

      requirements.txt

      apache-flink==1.12.0
      deepctr-torch==0.2.6
      black==20.8b1
      confluent-kafka==1.6.0
      pytest==6.2.2
      testcontainers==3.4.0
      kafka-python==2.0.2
      

       
      End result: the tar.gz is 854M, the zip is 1.6G

      So, long story short, python-archives only support zip, while zip is not a good choice for packaging ML libs. Let's change this by adding python-archives tar.gz support.

      Change will happen in this way: In ProcessPythonEnvironmentManager.java, check the suffix. If tar.gz, unarchive it using gzip decompresser.

      Attachments

        Issue Links

          Activity

            People

              Yik San Chan Yik San Chan
              Yik San Chan Yik San Chan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: