Downloadable Models

April 9, 2025

Announcement

Feature

Self-Hosting

Training

Starting today, you can download your trained models to host them on your own infrastructure. This opens the door for scenarios where compliance is a concern or you have special requirements (latency, throughput or cost) that you want to control yourself.

How does it work?

When Augento trains your model, it doesn't change the whole model (which would be 32B parameters for Qwen2.5-32B-Instruct). Instead, it only changes some parameters with a technique called LoRA (Low-Rank Adaptation). This allows us to efficiently fine-tune your model, without the compute resources required for full-parameter training.

The result is a small adapter, which can simply be merged with the original model (even on the fly and without restart) to perform inference.

How do I get my adapter?

After you've trained your model, you can download the adapter from the model page:

Then click on the "Create Download Link" button:

Wait a few seconds (we need to prepare the download) and you will get a link to download the adapter, which you can either click on directly or copy to your clipboard.

The download link is valid for 1 hour, after which it will expire.

What do I download?

You will get a zip file containing the adapter files.

You need to unzip it to a folder. In this example, we'll use augento-adapter.

It is in the standard huggingface adapter format, so you can use it with from_pretrained of the transformers library.

Hardware Requirements

We tried this example on a 24 GB RTX 4090. Please have a recent version of CUDA installed.

Install dependencies

pip install -U transformers peft bitsandbytes tokenizers

Load the adapter

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("./augento-adapter/")

Run inference

from transformers import pipeline, AutoTokenizer

# Chat completion
generator = pipeline("text-generation", model=model, 
                    tokenizer=AutoTokenizer.from_pretrained('./augento-adapter/'))
generator([
    {
        "role": "user",
        "content": "Hello! How are you?"
    }
], max_new_tokens=1024)

Your model should now output something like this:

[{'generated_text': [{'role': 'user', 'content': 'Hello! How are you?'},
   {'role': 'assistant',
    'content': "Hello! As an AI, I don't have feelings or physical sensations, but I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"}]}]

Inference

You can use the adapter you just downloaded for inference on your own infrastructure. While you can simply upload it to huggingface hub and choose inference providers from there and you can also do inference yourself with vllm.

Back to all blogs