Introduction to Private Serverless Models
Private Model Deployments
fal is a Generative Media Cloud with a model marketplace, private model deployments, and a training acceleration platform. This section covers our private model deployment system, which enables you to host your custom models and workflows on our infrastructure—the same infrastructure that powers our own models.
Key Features
- A unified framework for running, deploying, and productionizing your ML models
- Access to tens of thousands of GPUs, with dynamic scale up/down policies
- Full observability into requests, responses, and latencies (including custom metrics)
- Native HTTP and WebSocket clients that can be used for both fal-provided models and your own models
- Access to fal’s Inference Engine for accelerating your models/workflows
- And much more
Getting Started
Installation
If you have access to our private model deployment offering, create a fresh virtual environment (we strongly recommend Python 3.11, but other versions are supported) and install the fal
package:
pip install --upgrade fal
Authentication
Log in to either your personal account or any team you’re a member of. Be careful to select the proper entity, as the private beta access is probably set only for your team account and not your personal account.
fal auth login
When prompted, select your team:
If browser didn't open automatically, on your computer or mobile device navigate to [...]
Confirm it shows the following code: [...]
✓ Authenticated successfully, welcome!
Please choose a team account to use or leave blank to use your personal account:[team1/team2/team3]: team1
Confirm you selected the right team:
fal auth whoami
Running Your First Model
Every deployment in fal is a subclass of fal.App
which consists of one or more @fal.endpoint
decorators. For simple models or workflows with only one endpoint, it generally takes the root (/
) endpoint.
Here’s an example application that runs FLUX.1-schnell, an open-source text-to-image model:
import falfrom pydantic import BaseModel, Fieldfrom fal.toolkit import Image
class Input(BaseModel): prompt: str = Field( description="The prompt to generate an image from", examples=["A beautiful image of a cat"], )
class Output(BaseModel): image: Image
class MyApp(fal.App, keep_alive=300, name="my-demo-app"): machine_type = "GPU-H100" requirements = [ "hf-transfer==0.1.9", "diffusers[torch]==0.32.2", "transformers[sentencepiece]==4.51.0", "accelerate==1.6.0", ]
def setup(self): # Enable HF Transfer for faster downloads import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
import torch from diffusers import AutoPipelineForText2Image
# Load any model you want, we'll use Flux.1-schnell # Huggingface models will be automatically downloaded to # the persistent storage of your account
self.pipe = AutoPipelineForText2Image.from_pretrained( "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16, ).to("cuda")
# Warmup the model before the first request self.warmup()
def warmup(self): self.pipe("A beautiful image of a cat", num_inference_steps=4)
@fal.endpoint("/") def run(self, request: Input) -> Output: result = self.pipe(request.prompt, num_inference_steps=4) image = Image.from_pil(result.images[0]) return Output(image=image)
The application is divided into four parts:
- I/O: Defines inputs for the inference process (which in this case just takes a prompt), and the outputs (using the special
fal.toolkit.Image
which automatically uploads your images to fal’s CDN) - App definition: Each app is completely separated from your local computer. You need to precisely define what dependencies your app requires to run in the cloud. In this case, it’s
diffusers
and other related packages. setup()
function: Before an app starts serving requests, it will always run the user-definedsetup()
function. This is where you download your models, load them to GPU memory, and run warmups.@fal.endpoint("/")
: Using the I/O definitions and the pipeline loaded insetup()
, this is where you implement the inference process. In this example, it simply calls the pipeline with the user’s prompt and wraps the created PIL image withImage.from_pil()
to upload it to the CDN.
Testing Your Application
To run and test your app locally to ensure everything works:
fal run example.py::MyApp
Notes:
- During the first run or after any dependency change, we’ll build a Python environment from scratch. This process might take a couple of minutes, but as long as you don’t change the environment definition, it will reuse the same pre-built environment.
- The initial start will also download the models, but they’ll be saved to a persistent location (under
/data
) where they will always be available. This means the next time you run this app, the model won’t have to be downloaded again.
This command will print two links for you to interact with your app and start streaming the logs:
2025-04-07 21:37:41.001 [info ] Access your exposed service at http://fal.run/your-user/051cf487-8f52-43dc-b793-354507637dd02025-04-07 21:37:41.001 [info ] Access the playground at http://fal.ai/dashboard/sdk/your-user/051cf487-8f52-43dc-b793-354507637dd0==> RunningINFO: Started server process [38]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
The first link is an HTTP proxy that goes directly to your app, allowing you to call any endpoint without authentication as long as fal run
is active:
curl $FAL_RUN_URL -H 'content-type: application/json' -H 'accept: application/json, */*;q=0.5' -d '{"prompt":"A cat"}'
Alternatively, for root endpoints, you can visit the auto-generated fal playground to interact with your model through the web UI. This option requires authentication (you need to be using your team account if you started the fal run
through a team account).
Deploying & Productionizing
Once you feel the model is ready for production and you don’t want to maintain a fal run
session, you can deploy the model. Deployment provides a persistent URL that either routes requests to an existing runner or starts a new one if none are available:
fal deploy example.py::MyApp --auth=privateRegistered a new revision for function 'my-demo-app' (revision='5b23e1b1-af88-4ab0-aebc-415b2b1e34b4').Playground: http://fal.ai/models/fal-ai/my-demo-app/Endpoints: http://fal.run/fal-ai/my-demo-app/
Once deployed, you can go to the playground or make an authenticated HTTP call to use your model.
Note:
- Since this is the first invocation, and you didn’t set any minimum number of runners (
min_concurrency
), it will start a new container and load the model. Once your request is finished, the runner will remain active forkeep_alive
seconds. This means your subsequent requests within thekeep_alive
window would be instant. - This behavior can be configured with the
fal app scale
command. See the documentation for more details.
Monitoring & Analytics
For production deployments, you’ll want more observability over your deployed app. The Analytics page allows you to see all requests and identify error cases or slower ones along with their payloads.
You can see logs attached to a specific request in the request’s detail page, or view global logs for all runners in the logs page.
Advanced Configuration
keep_alive
The keep_alive
setting enables the server to continue running even when there are no active requests. By setting this parameter, you ensure that if you hit the same application within the specified time frame, you can avoid any startup overhead.
keep_alive
is measured in seconds. In the example below, the application will keep running for at least 300 seconds after the last request:
class MyApp(fal.App, keep_alive=300): ...
Min/Max Concurrency
fal applications have a simple managed autoscaling system. You can configure the autoscaling behavior through min_concurrency
and max_concurrency
:
class MyApp(fal.App, keep_alive=300, min_concurrency=1, max_concurrency=5): ...
min_concurrency
: Indicates the number of replicas the system should maintain when there are no requestsmax_concurrency
: Indicates the maximum number of replicas the system should have. Once this limit is reached, all subsequent requests are placed in a managed queue
FAQ - First Asked Questions
-
How can I use local files or my repository with fal?
If your project is already a Python package (e.g., has
__init__.py
and can be imported outside of the repo), you should be able to use it as is (import it at the top level and call the relevant functions). Note that if you’re using any external dependencies in your project, you’ll also need to include them in therequirements
field. -
I already have a Dockerfile with all my dependencies. Can I use it?
Yes! You can either pass us a pre-built Docker image as the base or your Dockerfile, and we’ll build it for you. Note that your image’s Python version and your local virtual environment’s Python version need to match.
-
How can I store my secrets?
Use
fal secrets set
and you can read them as environment variables from your code! Docs -
Do you offer persistent storage? How can I use it?
Anything written to
/data
will be available to all replicas and will be persisted. Be careful when storing many small files, as this will increase latencies (prefer large single blobs whenever possible, like model weights). Docs -
How can I scale my app?
The platform offers extensive ways to configure scaling. The simplest approach is to increase the minimum and maximum number of runners via
fal app scale $app --min-concurrency=N --max-concurrency=N
. Check our docs to tune these variables and learn about other concepts (decaying keep-alive, multiplexing, concurrency buffers, etc.). Docs -
What is the best way to deploy from CI?
You can create an
ADMIN
scoped key in your team account and usefal deploy
withFAL_KEY
set. Make sure to check our testing system as well. Docs