Skip to content

Introduction to Private Serverless Models

Private Model Deployments

fal is a Generative Media Cloud with a model marketplace, private model deployments, and a training acceleration platform. This section covers our private model deployment system, which enables you to host your custom models and workflows on our infrastructure—the same infrastructure that powers our own models.

Key Features

  1. A unified framework for running, deploying, and productionizing your ML models
  2. Access to tens of thousands of GPUs, with dynamic scale up/down policies
  3. Full observability into requests, responses, and latencies (including custom metrics)
  4. Native HTTP and WebSocket clients that can be used for both fal-provided models and your own models
  5. Access to fal’s Inference Engine for accelerating your models/workflows
  6. And much more

Getting Started

Installation

If you have access to our private model deployment offering, create a fresh virtual environment (we strongly recommend Python 3.11, but other versions are supported) and install the fal package:

Terminal window
pip install --upgrade fal

Authentication

Log in to either your personal account or any team you’re a member of. Be careful to select the proper entity, as the private beta access is probably set only for your team account and not your personal account.

Terminal window
fal auth login

When prompted, select your team:

If browser didn't open automatically, on your computer or mobile device navigate to [...]
Confirm it shows the following code: [...]
✓ Authenticated successfully, welcome!
Please choose a team account to use or leave blank to use your personal account:
[team1/team2/team3]: team1

Confirm you selected the right team:

Terminal window
fal auth whoami

Running Your First Model

Every deployment in fal is a subclass of fal.App which consists of one or more @fal.endpoint decorators. For simple models or workflows with only one endpoint, it generally takes the root (/) endpoint.

Here’s an example application that runs FLUX.1-schnell, an open-source text-to-image model:

import fal
from pydantic import BaseModel, Field
from fal.toolkit import Image
class Input(BaseModel):
prompt: str = Field(
description="The prompt to generate an image from",
examples=["A beautiful image of a cat"],
)
class Output(BaseModel):
image: Image
class MyApp(fal.App, keep_alive=300, name="my-demo-app"):
machine_type = "GPU-H100"
requirements = [
"hf-transfer==0.1.9",
"diffusers[torch]==0.32.2",
"transformers[sentencepiece]==4.51.0",
"accelerate==1.6.0",
]
def setup(self):
# Enable HF Transfer for faster downloads
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
import torch
from diffusers import AutoPipelineForText2Image
# Load any model you want, we'll use Flux.1-schnell
# Huggingface models will be automatically downloaded to
# the persistent storage of your account
self.pipe = AutoPipelineForText2Image.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
torch_dtype=torch.bfloat16,
).to("cuda")
# Warmup the model before the first request
self.warmup()
def warmup(self):
self.pipe("A beautiful image of a cat", num_inference_steps=4)
@fal.endpoint("/")
def run(self, request: Input) -> Output:
result = self.pipe(request.prompt, num_inference_steps=4)
image = Image.from_pil(result.images[0])
return Output(image=image)

The application is divided into four parts:

  1. I/O: Defines inputs for the inference process (which in this case just takes a prompt), and the outputs (using the special fal.toolkit.Image which automatically uploads your images to fal’s CDN)
  2. App definition: Each app is completely separated from your local computer. You need to precisely define what dependencies your app requires to run in the cloud. In this case, it’s diffusers and other related packages.
  3. setup() function: Before an app starts serving requests, it will always run the user-defined setup() function. This is where you download your models, load them to GPU memory, and run warmups.
  4. @fal.endpoint("/"): Using the I/O definitions and the pipeline loaded in setup(), this is where you implement the inference process. In this example, it simply calls the pipeline with the user’s prompt and wraps the created PIL image with Image.from_pil() to upload it to the CDN.

Testing Your Application

To run and test your app locally to ensure everything works:

Terminal window
fal run example.py::MyApp

Notes:

  • During the first run or after any dependency change, we’ll build a Python environment from scratch. This process might take a couple of minutes, but as long as you don’t change the environment definition, it will reuse the same pre-built environment.
  • The initial start will also download the models, but they’ll be saved to a persistent location (under /data) where they will always be available. This means the next time you run this app, the model won’t have to be downloaded again.

This command will print two links for you to interact with your app and start streaming the logs:

2025-04-07 21:37:41.001 [info ] Access your exposed service at http://fal.run/your-user/051cf487-8f52-43dc-b793-354507637dd0
2025-04-07 21:37:41.001 [info ] Access the playground at http://fal.ai/dashboard/sdk/your-user/051cf487-8f52-43dc-b793-354507637dd0
==> Running
INFO: Started server process [38]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

The first link is an HTTP proxy that goes directly to your app, allowing you to call any endpoint without authentication as long as fal run is active:

Terminal window
curl $FAL_RUN_URL -H 'content-type: application/json' -H 'accept: application/json, */*;q=0.5' -d '{"prompt":"A cat"}'

Alternatively, for root endpoints, you can visit the auto-generated fal playground to interact with your model through the web UI. This option requires authentication (you need to be using your team account if you started the fal run through a team account).

Deploying & Productionizing

Once you feel the model is ready for production and you don’t want to maintain a fal run session, you can deploy the model. Deployment provides a persistent URL that either routes requests to an existing runner or starts a new one if none are available:

Terminal window
fal deploy example.py::MyApp --auth=private
Registered a new revision for function 'my-demo-app' (revision='5b23e1b1-af88-4ab0-aebc-415b2b1e34b4').
Playground:
http://fal.ai/models/fal-ai/my-demo-app/
Endpoints:
http://fal.run/fal-ai/my-demo-app/

Once deployed, you can go to the playground or make an authenticated HTTP call to use your model.

Note:

  • Since this is the first invocation, and you didn’t set any minimum number of runners (min_concurrency), it will start a new container and load the model. Once your request is finished, the runner will remain active for keep_alive seconds. This means your subsequent requests within the keep_alive window would be instant.
  • This behavior can be configured with the fal app scale command. See the documentation for more details.

Monitoring & Analytics

For production deployments, you’ll want more observability over your deployed app. The Analytics page allows you to see all requests and identify error cases or slower ones along with their payloads.

You can see logs attached to a specific request in the request’s detail page, or view global logs for all runners in the logs page.

Advanced Configuration

keep_alive

The keep_alive setting enables the server to continue running even when there are no active requests. By setting this parameter, you ensure that if you hit the same application within the specified time frame, you can avoid any startup overhead.

keep_alive is measured in seconds. In the example below, the application will keep running for at least 300 seconds after the last request:

class MyApp(fal.App, keep_alive=300):
...

Min/Max Concurrency

fal applications have a simple managed autoscaling system. You can configure the autoscaling behavior through min_concurrency and max_concurrency:

class MyApp(fal.App, keep_alive=300, min_concurrency=1, max_concurrency=5):
...
  • min_concurrency: Indicates the number of replicas the system should maintain when there are no requests
  • max_concurrency: Indicates the maximum number of replicas the system should have. Once this limit is reached, all subsequent requests are placed in a managed queue

FAQ - First Asked Questions

  1. How can I use local files or my repository with fal?

    If your project is already a Python package (e.g., has __init__.py and can be imported outside of the repo), you should be able to use it as is (import it at the top level and call the relevant functions). Note that if you’re using any external dependencies in your project, you’ll also need to include them in the requirements field.

  2. I already have a Dockerfile with all my dependencies. Can I use it?

    Yes! You can either pass us a pre-built Docker image as the base or your Dockerfile, and we’ll build it for you. Note that your image’s Python version and your local virtual environment’s Python version need to match.

  3. How can I store my secrets?

    Use fal secrets set and you can read them as environment variables from your code! Docs

  4. Do you offer persistent storage? How can I use it?

    Anything written to /data will be available to all replicas and will be persisted. Be careful when storing many small files, as this will increase latencies (prefer large single blobs whenever possible, like model weights). Docs

  5. How can I scale my app?

    The platform offers extensive ways to configure scaling. The simplest approach is to increase the minimum and maximum number of runners via fal app scale $app --min-concurrency=N --max-concurrency=N. Check our docs to tune these variables and learn about other concepts (decaying keep-alive, multiplexing, concurrency buffers, etc.). Docs

  6. What is the best way to deploy from CI?

    You can create an ADMIN scoped key in your team account and use fal deploy with FAL_KEY set. Make sure to check our testing system as well. Docs