Databricks Auto Loader in Practice

Q: What's the difference between Auto Loader and COPY INTO?

Auto Loader (cloudFiles) is a Structured Streaming source built for continuous or recurring incremental ingestion with native schema inference and evolution. COPY INTO is a SQL command for one-shot batch loads against a fixed file list, with idempotent re-runs and no schema evolution. Use Auto Loader for any pipeline where new files arrive on a recurring schedule; use COPY INTO for backfills and ad-hoc loads.

Q: Is cloudFiles.useIncrementalListing deprecated?

Yes. Its default flipped from auto to false in DBR 17.3 LTS because non-lexicographic file orderings (UUID-prefixed names, cloud-side renames) silently skipped files. Databricks recommends migrating to file events (cloudFiles.useManagedFileEvents) or, on older runtimes, classic file notification mode.

Q: How does Auto Loader schema evolution work?

Auto Loader writes the inferred schema to cloudFiles.schemaLocation and versions it. The behavior on new fields depends on cloudFiles.schemaEvolutionMode: addNewColumns (default) throws UnknownFieldException and restarts with the new column added, addNewColumnsWithTypeWidening does the same plus int→long / float→double widening, rescue keeps the schema fixed and lands new fields in _rescued_data, failOnNewColumns halts permanently for manual update, and none silently drops new fields.

Q: How do I reset Auto Loader's schema or replay all files?

Delete both the schemaLocation and the checkpointLocation, then re-run the stream. Leaving either path behind produces partial-state errors. Inside Lakeflow Declarative Pipelines, the equivalent operation is FULL REFRESH on the table, which is the supported way to wipe state without touching the underlying paths.

A practical guide to Databricks Auto Loader: file discovery modes, file events, schema evolution and _rescued_data, Trigger.AvailableNow, and Lakeflow streaming tables.

This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.

Almost every Databricks account that ingests files from cloud object storage eventually converges on Auto Loader: the cloudFiles source in Structured Streaming that incrementally discovers new files in S3, ADLS, or GCS and writes them to a Delta table. The premise is straightforward — point a stream at a prefix, get exactly-once ingestion with schema inference and evolution baked in — but in practice almost every team trips over the same handful of options: directory listing vs file notifications, how cloudFiles.schemaEvolutionMode interacts with _rescued_data, when Trigger.AvailableNow is cheaper than continuous streaming, and how Auto Loader behaves inside Lakeflow Declarative Pipelines (the product formerly known as Delta Live Tables).

Databricks repackaged the ingestion stack as Lakeflow Connect at the 2025 Data + AI Summit, and Auto Loader is now the file-ingestion path under that umbrella. Two things changed in practice: file events went GA (the new “managed” notification mode that removes the directory-listing vs SNS/SQS tradeoff), and the legacy cloudFiles.useIncrementalListing flag was deprecated and defaulted to false in DBR 17.3 LTS. Existing pipelines keep working, but the recommended configuration for any new bronze-layer ingest is different from what most teams shipped in 2023.

This post walks through the options that matter for a production deployment, the schema-evolution semantics that determine whether records land in the right column or in _rescued_data, how to tune file discovery for high-volume prefixes, and the patterns that show up when Auto Loader is wrapped in a Lakeflow streaming table.

In this post, we’ll cover:

Where Auto Loader fits — Lakeflow Connect, COPY INTO, and managed connectors compared
The cloudFiles source — minimal configuration and the four things every stream needs
Schema inference and evolution — schemaLocation, schemaHints, the five schemaEvolutionMode values, and _rescued_data
File discovery — directory listing vs file notification vs file events, and the useIncrementalListing deprecation
Triggered vs continuous — Trigger.AvailableNow, rate limits, and backfills
Production patterns — checkpointing, monitoring, partition columns, dead-letter handling
Auto Loader inside Lakeflow Declarative Pipelines — streaming tables and the parts you no longer configure
Pitfalls — schema location collisions, evolution loops, notification quota limits, and more

Where Auto Loader fits

Lakeflow Connect is Databricks’ umbrella term for ingestion, and it groups three layers: fully-managed connectors (Salesforce, Workday, SQL Server, ServiceNow), standard connectors for cloud storage and message buses, and Auto Loader as the canonical file-ingestion path within the standard tier. The flow chart Databricks recommends — start with the most managed layer and drop down only when needed — places Auto Loader as the right answer when files arrive in object storage and a managed connector doesn’t cover the format.

The two alternatives for cloud-storage files are COPY INTO and one-shot reads. COPY INTO is a transactional batch command that’s idempotent against the file list it’s been given and writes to a Delta table in one shot. It’s the right tool for ad-hoc backfills, well-bounded historical loads, and any case where you want a single SQL statement instead of a streaming job. It’s not the right tool for a recurring ingest where new files arrive continuously, because every invocation re-scans the directory to find unprocessed files — fine for thousands of files, painful for millions. One-shot reads (spark.read.format("json").load(...)) are the right tool for exploratory work and intentionally bad for any pipeline because there’s no incremental state.

Auto Loader is the steady-state answer: incremental discovery via metadata stored in the stream’s checkpoint, schema inference and evolution as first-class features, and an integration with Structured Streaming that gives you exactly-once semantics through the standard checkpoint mechanism. The file metadata is persisted in a RocksDB-backed key-value store inside the checkpoint location, so resuming a stream after a restart never re-processes a file and never misses one.

The cloudFiles source

The minimum viable Auto Loader stream needs four pieces: a source format, a schema location, a checkpoint location, and a target table. In PySpark:

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/Volumes/main/bronze/_schemas/events")
  .load("s3://acme-events/raw/")
  .writeStream
  .option("checkpointLocation", "/Volumes/main/bronze/_checkpoints/events")
  .trigger(availableNow=True)
  .toTable("main.bronze.events"))

That’s it. Auto Loader will sample up to 50 GB or 1,000 files (whichever comes first) on first run to infer a schema, write the inferred schema to schemaLocation/_schemas/v0, then start streaming. The default trigger is continuous micro-batches; availableNow=True (recommended for batch-style runs — more on that below) processes whatever is currently in the prefix and shuts the stream down.

Supported formats are JSON, CSV, Parquet, Avro, ORC, XML, TEXT, and BINARYFILE — cloudFiles.format accepts any of them, and the inference + evolution behavior is consistent across formats with one exception: untyped formats (JSON, CSV, XML) default every inferred column to STRING unless you opt in to type inference with .option("cloudFiles.inferColumnTypes", "true"). We’ve seen teams burn an afternoon wondering why their numeric columns are strings; the option is opt-in to keep behavior deterministic across schema versions.

The schemaLocation and checkpointLocation should be separate paths, both inside a Unity Catalog volume in any new deployment. Putting both at the same path used to be common — and still works in error messages and old examples — but conflating them makes it impossible to clear schema state without clearing the checkpoint, which means re-processing every file. Keep them apart.

Schema inference and evolution

The _rescued_data column is the single most important behavior to internalize. By default, every Auto Loader stream adds a _rescued_data column to the target schema, and any field in an incoming record that doesn’t match the current schema — wrong type, unknown column, casing mismatch — lands there as a JSON blob along with the source file path. The stream never fails on a bad record; it rescues it.

That’s the safety net. The evolution mode controls what happens to new columns specifically: a field that appears in incoming records but isn’t in the schema yet.

The cloudFiles.schemaEvolutionMode option takes five values:

addNewColumns (default in current DBR) — Stream throws UnknownFieldException, the orchestration layer restarts it with the updated schema, and the new column appears. Subsequent records populate it normally. This is the right behavior for most bronze-layer ingestion and works correctly inside Lakeflow Declarative Pipelines, which handles the restart automatically.
addNewColumnsWithTypeWidening — Same as above, plus compatible type widening (int → long, float → double). Useful when upstream producers occasionally upcast a column.
rescue — Schema never evolves. New columns land in _rescued_data as JSON, original columns continue as inferred. Right answer for “I have a tightly-controlled downstream schema and want to detect drift in _rescued_data rather than absorbing it”.
failOnNewColumns — Stream halts permanently when a new column appears, requires manual schema update. Use this on tables governed by a contract.
none — Quietest mode. New columns are simply ignored. Almost never the right answer because you lose data silently.

Override inferred types with cloudFiles.schemaHints using SQL DDL syntax:

.option("cloudFiles.schemaHints",
        "user_id BIGINT, tags MAP<STRING,STRING>, created_at TIMESTAMP")

Hints win against inference but lose against incoming type widening — a user_id hinted as BIGINT will keep being BIGINT even if the producer sends it as a string, with the original string value falling into _rescued_data. That’s the right behavior for hardening downstream tables against producer mistakes.

Partition columns inferred from Hive-style paths (s3://.../event=click/date=2026-06-11/) are extracted automatically. If your paths use a non-standard layout, set cloudFiles.partitionColumns to a comma-separated list of the column names you want pulled out.

The schemas are versioned. Auto Loader writes a new file under schemaLocation/_schemas/ every time the schema evolves, and the stream’s checkpoint records which version it’s currently on. Resetting evolution back to a clean state requires deleting both the schema location and the checkpoint — leaving either behind produces confusing partial-state errors on restart.

See what QueryPlane can build for you

Connect to your database, write SQL with AI, and build shareable apps — all from your browser.

Get Started Book a Demo

File discovery

How Auto Loader finds new files is the single biggest production tuning lever. There are three modes, and the recommendation in 2026 is different from what was right two years ago.

Directory listing mode (the default when you don’t configure anything else) walks the source prefix and compares the file list against the metadata store in RocksDB. It’s simple — no IAM roles, no cloud resources — and scales linearly with the number of files in the prefix. For a few thousand files per micro-batch, fine. For a prefix with millions of objects partitioned across years of data, every micro-batch starts to spend more time listing than processing.

File notification mode (cloudFiles.useNotifications = true) shifts discovery from polling to push. Auto Loader provisions the cloud-native notification resources on first run: SNS topic + SQS queue on AWS, Event Grid subscription + Storage Queue on Azure, Pub/Sub on GCP. New objects emit notifications, Auto Loader reads them off the queue, and the listing cost disappears. The tradeoffs: you need IAM permissions to create those resources, and you hit a per-bucket concurrency cap (100 concurrent pipelines per S3 bucket, 500 per Azure storage account, 100 per GCS bucket).

File events mode (cloudFiles.useManagedFileEvents = true) is the 2025 GA mode and what Databricks now recommends for everything new. It sits on top of Unity Catalog external locations: Databricks owns the notification resources, queue lifecycle, and concurrency management, so the per-bucket cap effectively goes away and the IAM-on-the-customer-side cost vanishes. Setup is one flag on the stream and an external-location flag on the Unity Catalog volume. Requirements are a Unity Catalog enabled workspace and DBR 14.3 LTS or above.

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.useManagedFileEvents", "true")
  .option("cloudFiles.schemaLocation", "/Volumes/main/bronze/_schemas/events")
  .load("s3://acme-events/raw/"))

The third historical mode — cloudFiles.useIncrementalListing — is the one to leave behind. It tried to speed up directory listing by remembering the last-seen file and only listing files lexicographically after it, which works only when files arrive in sortable order. Databricks deprecated it and flipped its default from auto to false in DBR 17.3 LTS because non-lexicographic file orderings (the common case with UUID-prefixed filenames, or any cloud-side rename) silently skip files. If your existing pipeline relies on the auto default, set it explicitly during the upgrade window and migrate to file events at the next opportunity.

The rule of thumb for sizing: if you ingest under ~50,000 new files per day per prefix, directory listing is fine and saves you the IAM setup. Above that, switch to file events. Above a few hundred thousand files per day, the classic notification mode’s queue throughput becomes a bottleneck and file events’ managed scaling matters.

Triggered vs continuous

Auto Loader runs as a Structured Streaming query, which means it has the usual trigger() options. The two that matter in practice are availableNow=True (process everything in the source and stop, suitable for orchestrated batch runs) and continuous micro-batches (the default, run until killed, suitable for low-latency pipelines).

Trigger.AvailableNow is underused. For most bronze-layer ingest where the SLA is “data should be available in the next hour or two”, running Auto Loader on a Lakeflow Jobs hourly schedule with availableNow=True produces the same Delta table at a fraction of the cost of a continuously-running stream — the compute spins up, processes the backlog in one or more micro-batches respecting rate limits, and shuts down. Idempotency is guaranteed by the checkpoint; if a run is killed mid-flight, the next run picks up exactly where it left off without re-processing.

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "parquet")
  .option("cloudFiles.schemaLocation", "/Volumes/main/bronze/_schemas/orders")
  .option("cloudFiles.maxFilesPerTrigger", "5000")
  .option("cloudFiles.maxBytesPerTrigger", "10g")
  .load("s3://acme-orders/raw/")
  .writeStream
  .option("checkpointLocation", "/Volumes/main/bronze/_checkpoints/orders")
  .trigger(availableNow=True)
  .toTable("main.bronze.orders"))

The two rate limits in this snippet are worth knowing precisely. cloudFiles.maxFilesPerTrigger is a hard limit with a default of 1,000 — Auto Loader will never pull more than this many files in a single micro-batch. cloudFiles.maxBytesPerTrigger is a soft limit — Auto Loader stops adding files to a micro-batch once the byte threshold is crossed, but it doesn’t split a file mid-stream, so actual bytes processed can exceed the threshold by one file’s worth. When both are set, whichever limit is hit first ends the micro-batch.

Tuning these matters when files are very small or very large. With small files (a few KB), the default maxFilesPerTrigger=1000 produces a tiny micro-batch even on a beefy cluster — bump it to 50,000 or higher and let Spark concatenate. With multi-GB files, the default never hits the file limit but the byte limit prevents OOMs on the executors.

The third lever is cloudFiles.backfillInterval. In notification or file-events mode, the queue is the source of truth for new files. If notifications are dropped — and they occasionally are, at the cloud-provider level — files can be missed. cloudFiles.backfillInterval = '1 day' asynchronously triggers a full directory list at that cadence and reconciles against the metadata store, ensuring missed notifications get picked up without producing duplicates. Set this on every notification or file-events stream; the cost is one full list per day and the alternative is unbounded silent data loss.

Production patterns

Three patterns show up on every production Auto Loader deployment.

Monitoring via StreamingQueryListener. Auto Loader exposes two metrics on the streaming query progress that are the right thing to alert on: numFilesOutstanding and numBytesOutstanding. Both report the backlog: files seen in the source that haven’t yet been processed. A healthy stream has those numbers fluctuating around zero between micro-batches. A growing backlog is the symptom you want to catch before it becomes an outage. Register a listener on every production stream:

class BacklogListener(StreamingQueryListener):
    def onQueryProgress(self, event):
        progress = event.progress
        for src in progress.sources:
            if "cloudFiles" in src.description:
                metrics = src.metrics
                outstanding = int(metrics.get("numFilesOutstanding", "0"))
                if outstanding > 100_000:
                    # alert into your incident channel
                    pass

spark.streams.addListener(BacklogListener())

The listener fires after every micro-batch; combine it with the cluster’s native metrics and you have full visibility without polling.

Dead-letter handling via _rescued_data. Records that don’t match the schema land in _rescued_data as a JSON blob. Running a periodic query over the bronze table looking for non-null _rescued_data values catches the cases that quiet schema modes hide. A standard pattern is a silver-layer table that splits records by _rescued_data IS NULL — the good records go on to the silver model, the rescued records go to a quarantine table with the source file path for replay or upstream debugging. That’s the closest Auto Loader gets to a true dead-letter queue.

Partition column inference. Auto Loader pulls Hive-style partition keys out of file paths automatically: s3://acme/raw/year=2026/month=06/day=11/event.json produces year, month, day columns on the target table. This is the right way to expose the partitioning of the source to downstream queries, but it’s also the source of an easy mistake — if the prefix path itself contains a = sign (a Unity Catalog-flavored path with volume=foo in it), Auto Loader will try to extract it as a partition column and fail confusingly. Setting cloudFiles.partitionColumns explicitly to the columns you actually want (or to an empty string if you want none) is the workaround.

Auto Loader inside Lakeflow Declarative Pipelines

Lakeflow Declarative Pipelines (the new name for Delta Live Tables) is the highest-leverage place to run Auto Loader in 2026. Inside a pipeline, a streaming table backed by cloudFiles looks like this in SQL:

CREATE OR REFRESH STREAMING TABLE bronze_events AS
SELECT *, _metadata.file_path AS _source_file
FROM STREAM read_files(
  's3://acme-events/raw/',
  format => 'json',
  schemaHints => 'user_id BIGINT, created_at TIMESTAMP'
);

The read_files table-valued function is the SQL surface for Auto Loader, and inside a Lakeflow pipeline it implicitly uses the file-events mode, manages the schema location and checkpoint under the pipeline’s storage root, and handles the UnknownFieldException restart loop transparently when a new column appears. You don’t pass cloudFiles.schemaLocation or cloudFiles.checkpointLocation — the pipeline owns them. That’s the main behavioral difference from a hand-rolled Structured Streaming job.

The Python equivalent uses the same read_files source:

import dlt

@dlt.table
def bronze_events():
    return (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .option("cloudFiles.inferColumnTypes", "true")
      .load("s3://acme-events/raw/"))

For a more end-to-end walk-through of how streaming tables and materialized views compose in a Lakeflow pipeline — including expectations, change data capture via APPLY CHANGES INTO, and triggered vs continuous compute — see our Databricks Lakeflow Declarative Pipelines (Delta Live Tables) in Practice post.

The single thing to remember when migrating from a standalone Auto Loader job to a Lakeflow streaming table: schema location and checkpoint move under the pipeline’s storage. The reset story changes — you do FULL REFRESH on the table from the Lakeflow UI rather than deleting paths by hand. Don’t try to mix-and-match: pointing a pipeline streaming table at a manually-managed cloudFiles.schemaLocation outside the pipeline’s storage works the first time and breaks on the second deploy.

COPY INTO comparison

COPY INTO (docs) is the right tool for a different shape of problem. It’s a SQL command that ingests a fixed list of files (or a pattern that matches a fixed set) in a single transaction. It tracks ingested files in the target table’s metadata, so re-running the command against the same prefix is idempotent — already-ingested files are skipped. It supports schema inference but not evolution, so a COPY INTO against a directory whose files have drifted in schema either rejects the bad records or fails the command, depending on the mode.

Choose COPY INTO when:

The ingest is a one-shot or rare-batch backfill where Structured Streaming infrastructure is overkill.
The target schema is fixed and contracts upstream — a new column should be a deliberate change, not auto-evolved.
You want the operation to be a single SQL statement runnable from a notebook or BI tool.

Choose Auto Loader when:

Files arrive continuously or on any recurring schedule beyond a few times per day.
Schema evolution is expected and the table is incrementally built.
The source has more than a few thousand files and listing cost matters.

There’s no equivalent to cloudFiles.useManagedFileEvents for COPY INTO — it always lists the source — so anything streaming-shaped should be on Auto Loader, ideally via Lakeflow Declarative Pipelines.

Common pitfalls

A few recurring failure modes worth pre-empting:

Schema location and checkpoint at the same path. Old Databricks blog posts and tutorials put cloudFiles.schemaLocation and checkpointLocation at the same URI. It works, but it makes resetting the schema state require destroying the checkpoint, which forces a full re-process of every file. Put them at separate paths under a common parent, both in a Unity Catalog volume.

addNewColumns restart loop. When schemaEvolutionMode = addNewColumns, the first record with a new column throws UnknownFieldException and the stream needs to be restarted. Outside Lakeflow Declarative Pipelines, that restart is the caller’s responsibility — a bare spark.readStream won’t restart itself. Wrap the stream in a Lakeflow Jobs task with retries, or run it inside Declarative Pipelines, or use addNewColumnsWithTypeWidening with explicit error handling. The “stream stops at the first new column” failure mode is the most-reported issue we see.

Notification mode IAM drift. Auto Loader provisions SNS/SQS, Event Grid + Storage Queue, or Pub/Sub on first run, and stores the resource references in the checkpoint. If the IAM role the workspace runs under loses one of those permissions later, the stream silently can’t read the queue and the backlog grows without errors. Set up numFilesOutstanding alerting on every notification-mode stream and audit IAM with the databricks-auto-ingest prefix.

Forgetting backfillInterval on notification streams. Notification delivery has cloud-provider-level reliability — close to but not exactly 100%. Without cloudFiles.backfillInterval, missed notifications mean permanently-missed files. Set it to '1 day' on every notification or file-events stream as a matter of policy.

Mixing cloudFiles.useIncrementalListing = true and notification mode. The two options are mutually exclusive but the error is only thrown when the stream tries to run, not at configuration time. The old auto default for incremental listing made this a foot-gun before DBR 17.3 LTS; on newer runtimes the default of false makes it harder to trip into.

Treating _rescued_data as exception data. The rescued data column is the catch-all for anything that doesn’t match the current schema, including casing mismatches and missing fields, not just structurally bad records. Many teams discover they have valid data flowing into _rescued_data for months because a producer started sending user_id instead of userId and the schema evolution mode was set to rescue. Have a non-null _rescued_data count alert wired to a silver-layer job and a slack channel.

Partition columns colliding with cloudFiles.partitionColumns. If the source prefix path contains a key=value segment that isn’t a real partition (a Unity Catalog volume name, an environment marker, a region tag), Auto Loader extracts it as a column and the target table grows a spurious column on first run. Set cloudFiles.partitionColumns explicitly — to either the columns you want or an empty string — on every stream where the prefix path isn’t strictly Hive-partitioned.

FAQ

What’s the difference between Auto Loader and `COPY INTO`?

Auto Loader (cloudFiles) is a Structured Streaming source built for continuous or recurring incremental ingestion with native schema inference and evolution. COPY INTO is a SQL command for one-shot batch loads against a fixed file list, with idempotent re-runs and no schema evolution. Use Auto Loader for any pipeline where new files arrive on a recurring schedule; use COPY INTO for backfills and ad-hoc loads.

What does `cloudFiles.useManagedFileEvents` do?

It’s the 2025-GA “managed” notification mode for Auto Loader. Databricks provisions and operates the notification resources (SNS/SQS, Event Grid + Storage Queue, Pub/Sub) on top of Unity Catalog external locations, removing the IAM-setup and per-bucket concurrency limits of the classic cloudFiles.useNotifications mode. It’s now the recommended discovery mode for new pipelines on DBR 14.3 LTS and above.

Is `cloudFiles.useIncrementalListing` deprecated?

Yes. Its default flipped from auto to false in DBR 17.3 LTS because non-lexicographic file orderings (UUID-prefixed names, cloud-side renames) silently skipped files. Databricks recommends migrating to file events (cloudFiles.useManagedFileEvents) or, on older runtimes, classic file notification mode.

How does Auto Loader schema evolution work?

Auto Loader writes the inferred schema to cloudFiles.schemaLocation and versions it. The behavior on new fields depends on cloudFiles.schemaEvolutionMode: addNewColumns (default) throws UnknownFieldException and restarts with the new column added, addNewColumnsWithTypeWidening does the same plus int→long / float→double widening, rescue keeps the schema fixed and lands new fields in _rescued_data, failOnNewColumns halts permanently for manual update, and none silently drops new fields.

What is the `_rescued_data` column?

A JSON-typed column Auto Loader adds to every target table that captures values which couldn’t be parsed into the current schema — missing columns, type mismatches, casing mismatches — alongside the source file path. It’s the safety net that prevents Auto Loader from failing on a bad record. Treat non-null values in it as something to investigate, not as expected.

When should I use `Trigger.AvailableNow`?

Whenever the SLA for the bronze table is loose enough to tolerate hourly or daily latency. Trigger.AvailableNow processes the current backlog and shuts down, so the cluster only runs during ingest. For a steady ingest at a few files per minute, an hourly availableNow=True schedule is often 5-10x cheaper than a continuously-running stream and produces an identical target table. Reserve continuous triggers for genuinely low-latency requirements.

How does Auto Loader handle exactly-once semantics?

File-level metadata (which files have been processed) is persisted in a RocksDB key-value store inside the stream’s checkpointLocation. On restart, Auto Loader reads the metadata before listing or polling the source, so already-processed files are not re-processed and in-flight files are reconciled against the checkpoint. As long as the checkpoint location is intact, exactly-once at the file level is guaranteed end-to-end.

What’s `cloudFiles.backfillInterval` for?

A safety net against missed notifications. In notification or file-events mode, the queue can occasionally drop a message at the cloud-provider layer. cloudFiles.backfillInterval = '1 day' triggers an asynchronous full directory list at that cadence to reconcile against the metadata store, picking up any missed files without producing duplicates. Set it on every notification or file-events stream as a matter of policy.

Does Auto Loader work inside Lakeflow Declarative Pipelines?

Yes — it’s the recommended way to use it in 2026. The SQL surface is the read_files table-valued function or STREAM read_files(...) for streaming tables; the Python surface is the same cloudFiles API wrapped in @dlt.table. The pipeline implicitly manages schemaLocation and checkpointLocation under the pipeline’s storage root, uses file events by default, and handles UnknownFieldException restarts automatically. Don’t try to share a checkpoint between a pipeline-managed and a hand-rolled cloudFiles stream.

How do I reset Auto Loader’s schema or replay all files?

Delete both the schemaLocation and the checkpointLocation, then re-run the stream. Leaving either path behind produces partial-state errors. Inside Lakeflow Declarative Pipelines, the equivalent operation is FULL REFRESH on the table, which is the supported way to wipe state without touching the underlying paths.

Wrapping up

Auto Loader is the right answer for almost any “new files arriving in cloud storage, land them in Delta” pipeline on Databricks. The 2025 changes — file events going GA under cloudFiles.useManagedFileEvents, the useIncrementalListing deprecation, and the Lakeflow Connect rebrand bringing Auto Loader and the new managed connectors under one umbrella — are mostly defaults-and-recommendations changes rather than behavioral ones, but they meaningfully simplify what a new pipeline looks like compared to what teams shipped a couple years back.

If you’re inspecting Auto Loader output tables, debugging schema-evolution failures, or building internal tools on top of bronze-layer Delta tables, QueryPlane is a SQL editor and app builder that connects to your Databricks SQL warehouse and lets you build interactive apps over those tables — including views with the _rescued_data quarantine pattern wired into a dashboard for your data engineers. If you’re shopping around for a Databricks GUI generally, our roundup of the best Databricks GUI tools covers the alternatives. If Auto Loader is the front end to a bigger declarative pipeline, the Lakeflow Declarative Pipelines in Practice post picks up where this one leaves off. And for the storage layout that bronze tables should use in 2026, our Databricks Liquid Clustering in practice post covers CLUSTER BY, OPTIMIZE, and Predictive Optimization.