GusPuffy's picture
Update README.md
b68e027 verified
|
raw
history blame
5.29 kB
metadata
license: other
license_name: deepseek-license
license_link: LICENSE
tags:
  - Axolotl
  - Deepspeed
datasets:
  - GusPuffy/python-decompiler-37-0.7-train

Sentient Simulations Plumbob

[🏠Sentient Simulations] | [Discord] | [Patreon]


Sentient Simulations AI Python 3.7 Decompiler - 6.7b - v0.9

1. Introduction

The Sentient Simulations AI Python Decompiler is a deepseek-ai/deepseek-coder-6.7b-base finetune for the specific task of decompiling Python 3.7 bytecode back to its original Python source code.

2. Data Preparation

The Sentient Simulations AI Python Decompiler data used Python 3.7 source code that was then compiled to bytecode. The bytecode was used as the input and the source code was used as the output to teach the model how to generate the original source code from Python bytecode. Below are the steps to prepare the data.

  1. Grab a ton of Python code, or use something like The Stack v2, and compile it using the version of Python you want to use
  2. Throw out any code that doesn't compile to that version of Python
  3. Remove all comments from the code
  4. Format all the code using Python black for consistency
  5. Format the bytecode in a way that reduces tokens and is easier for the AI to read - I tried with custom tokens initially but I got inconsistent results
  6. Generate input output pairs for the training data
  7. Axolotl Sample packing was used to prepare the data with a constant context of 16k tokens

4. Training

The model was trained for 4 days on 3x3090s using Deepspeed Zero 3 Offload at 16k context.

5. Prerequisites

  1. Create a Python 3.7 environment to get the byte code
  2. Create a Python 3.10 environment to run the decompiler
  3. Download the python files and GGUF file
conda create -n pydecompiler-37 python=3.7 -y
conda create -n pydecompiler-310 python=3.10 -y
conda activate pydecompiler-310
pip install huggingface-hub
mkdir pydecompiler
cd pydecompiler
huggingface-cli download GusPuffy/sentient-simulations-pydecompiler-3.7-6.7b-v0.9 --local-dir . --local-dir-use-symlinks False --include "*.py"
huggingface-cli download GusPuffy/sentient-simulations-pydecompiler-3.7-6.7b-v0.9-GGUF --local-dir . --local-dir-use-symlinks False --include "sentient-simulations-pydecompiler-3.7-6.7b-v0.9-q8_0.gguf"
  1. Install llama cpp, make sure to use the prefix during install for whatever backend you want to use if you have GPUs or only want to use CPU

6. Test Example

  1. Convert a python file to source and byte code using Python 3.7
conda activate pydecompiler-37
python bytecode.py bytecode.py > bytecode-decompiled.pycb
  1. The bytecode has been written to bytecode-decompiled.pycb, now we need to switch to the Python 3.10 environment to run the decompiler on the test example
conda activate pydecompiler-310
python decompile.py bytecode-decompiled.pycb
  1. Compare the contents of the AI decompiled code in 'bytecode-decompiled.py' with the actual contents of 'bytecode.py'

7. Decompilation of a Directory of .pyc files

If you have a bunch of files you want to decompile, you can run the following command to decompile the entire directory.

  1. Convert the .pyc files to bytecode strings using Python 3.7
conda activate pydecompiler-37
python convert_pyc_to_bytecode.py directory_with_files
  1. Decompile the bytecode strings back to Python source code (This will take some time depending on how many files there are)
conda activate pydecompiler-310
python decompile.py directory_with_files

Note that files over 15k token count are skipped. Files over 10k will most likely be truncated due to the context limit.

At the end of the decompiled source code is a note if the context limit was reached or the AI decided it was complete:

# Finish Reason: stop
# Finish Reason: length

8. Next Iteration

The dataset has shown pretty amazing results for decompiling files under 16k context using only a 7b model.

I would like to train a larger context model or the 34b version of deepseek coder.

Let me know if you have compute available and you are interested in training a longer context version of this tool!

9. License

This is a fune tune of deepseek so checkout their license LICENSE-MODEL for more details.

10. Contact

If you have any questions, please raise an issue or find me on discord