Snowflake Stages, COPY INTO, and Snowpipe in Practice

Q: What's the difference between a Snowflake stage and a file format?

A stage is where the files live — a table stage, user stage, internal stage, or external stage pointing at S3/GCS/Azure. A file format is how Snowflake parses them — CSV with these delimiters, JSON with this null handling, Parquet with the vectorized reader. A stage can have a default file format attached, but you can override it at COPY INTO time, and the same file format can be reused across many stages and pipes.

Q: What does COPY INTO do that INSERT doesn't?

COPY INTO reads files from a stage, parses them with a file format, and writes the rows in bulk. INSERT writes literal row values from a SQL statement. For loading large volumes, COPY INTO is dramatically faster because it parallelizes across files and uses Snowflake's bulk-load path; INSERT is fine for handfuls of rows but does not scale. There is also INSERT INTO ... SELECT ... FROM @stage, which can be useful for small transformations but uses the bulk path internally and is functionally close to COPY INTO with a SELECT.

Q: Why does COPY INTO skip files I want it to load?

Snowflake's LOAD_HISTORY tracks loaded files for 64 days per table and skips files whose name and size match a previous load. If you renamed a file or its byte size is identical to a previously-loaded file, COPY INTO will skip it silently. The fastest diagnostic is SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(...)) for the affected table — it shows skipped files explicitly. Force a reload with FORCE = TRUE if you actually want to reload.

Q: What's the right warehouse size for COPY INTO?

Usually X-Small. COPY INTO parallelizes by file, so a warehouse with N workers can process at most N files concurrently. For loads of a few files at a time, larger warehouses pay for cores that have nothing to do. The exception is one very large file that decompresses and parses serially per worker; if you have a 50 GB Parquet file as a single load unit, a larger warehouse helps. Otherwise: X-Small, measure, only move up if EXECUTION_TIME is genuinely limiting.

How to load data into Snowflake reliably: named vs external stages, COPY INTO semantics, Snowpipe auto-ingest, Snowpipe Streaming, and the idempotency patterns that survive replays.

This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.

Almost every Snowflake account starts with a COPY INTO from a CSV in S3 and ends with a sprawling collection of stages, file formats, pipes, and Streaming ingestion jobs whose semantics nobody on the team remembers in full. Loading is the part of the data warehouse that touches everything — the orchestrator, the bucket policy, the file format the upstream system happens to emit, the schema-evolution policy nobody has actually written down — so the cost of getting it slightly wrong compounds across every downstream pipeline.

This post walks through the loading surface Snowflake exposes: where data actually lives before it is loaded (stages), the command that loads it (COPY INTO), the continuous-loading service (Snowpipe), and the low-latency streaming API (Snowpipe Streaming). It also covers the patterns that keep a load pipeline idempotent across replays, the file-sizing choices that actually move ingestion cost, and the failure modes that bite teams six months in.

In this post, we’ll cover:

The four stage types — table, user, named internal, named external — and when each is the right answer
File formats — CSV, JSON, Parquet, Avro, and the parameters that matter
COPY INTO mechanics — ON_ERROR, FORCE, PURGE, VALIDATION_MODE, and what each does to replays
Snowpipe auto-ingest — event notifications, the COPY_HISTORY view, and 14-day file-tracking semantics
Snowpipe Streaming — the row-level API that bypasses files entirely
Idempotency patterns — load-key columns, MERGE deduplication, and replay-safe orchestration
Cost levers — file sizing, warehouse sizing for COPY INTO, and the per-file overhead in Snowpipe

Stages: where files live before they load

A stage is the abstraction Snowflake uses to talk about a location that holds files for loading or unloading. The stage type determines who owns the bucket and how the bytes get there; the loading semantics are the same once Snowflake can see the files.

There are four stage types worth knowing.

Table stages are implicit — every table gets one named @%table_name. They are scoped to a single table, owned by Snowflake, and have no separate URL or credential. They’re useful for one-off loads against a specific table without provisioning anything else.

User stages are also implicit — every user gets one named @~. They’re scoped to the user, useful for ad-hoc work in worksheets, and largely unused in production pipelines because they don’t share well across an organization.

Named internal stages are explicitly created stage objects backed by Snowflake-managed storage. They live inside a schema, have grants like any other object, and are the right answer when you want a shared staging area without standing up your own S3 bucket. Files in an internal stage are encrypted at rest with Snowflake’s keys and you pay for the storage as part of the account’s storage bill.

CREATE STAGE raw_events.staging_internal
  FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1)
  COMMENT = 'Internal staging for raw event uploads';

Named external stages point at a bucket you own — S3, GCS, or Azure Blob — and are by far the most common pattern in production. The data is already in your storage account, written there by an upstream job, and the stage is the named handle Snowflake uses to read it. Authentication is either via a storage integration (recommended) or inline credentials (don’t).

CREATE STORAGE INTEGRATION s3_raw_integration
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-load'
  STORAGE_ALLOWED_LOCATIONS = ('s3://acme-raw-events/');

CREATE STAGE raw_events.staging_s3
  URL = 's3://acme-raw-events/events/'
  STORAGE_INTEGRATION = s3_raw_integration
  FILE_FORMAT = (FORMAT_NAME = raw_events.csv_with_header);

Storage integrations are worth the extra setup because they let you grant Snowflake’s IAM role access to a specific bucket without ever pasting an access key into a stage definition. Rotating the underlying AWS role doesn’t require touching the stage. And the bucket policy on the AWS side is the actual source of truth for who can read what, which is where most security audits expect it to be.

A useful diagnostic before you load anything: LIST @stage_name returns every file Snowflake can see at that stage, with sizes and timestamps. If LIST returns zero rows but you know the files are in the bucket, you almost certainly have a credentials, prefix, or bucket-policy problem, not a Snowflake problem.

File formats define how Snowflake parses the bytes

A file format is a named bundle of parsing rules — type (CSV / JSON / PARQUET / AVRO / ORC / XML), delimiter, quoting, compression, null handling — that you reference from a stage or a COPY INTO. Defining them as named objects rather than inlining every parameter into every COPY INTO is the difference between a pipeline you can maintain and one that drifts.

CREATE FILE FORMAT raw_events.csv_with_header
  TYPE = CSV
  FIELD_DELIMITER = ','
  FIELD_OPTIONALLY_ENCLOSED_BY = '"'
  SKIP_HEADER = 1
  NULL_IF = ('', 'NULL', 'null')
  EMPTY_FIELD_AS_NULL = TRUE
  COMPRESSION = AUTO
  ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE;

CREATE FILE FORMAT raw_events.json_lines
  TYPE = JSON
  STRIP_OUTER_ARRAY = FALSE
  STRIP_NULL_VALUES = FALSE
  COMPRESSION = AUTO;

CREATE FILE FORMAT raw_events.parquet
  TYPE = PARQUET
  USE_LOGICAL_TYPE = TRUE
  USE_VECTORIZED_SCANNER = TRUE;

A few parameters earn their keep:

ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE is the difference between a CSV with an extra column silently loading garbage into the wrong fields and a load that fails loudly so you fix the upstream schema.
NULL_IF on CSV controls which string values become SQL NULL. Without it, a column containing the literal string NULL lands as the four characters, not a real null, and every downstream IS NULL filter quietly misses those rows.
USE_VECTORIZED_SCANNER = TRUE on Parquet enables the vectorized reader and is meaningfully faster for wide tables. It’s on by default in newer accounts but worth setting explicitly.
STRIP_OUTER_ARRAY on JSON matters when the source emits a single JSON array of rows versus newline-delimited JSON (NDJSON). The wrong choice loads one row containing an array instead of N rows.

For Parquet and Avro specifically, the MATCH_BY_COLUMN_NAME option on COPY INTO makes Snowflake map columns by name rather than position, which is the only thing that keeps schema evolution from breaking every load when the source adds a new column in the middle.

`COPY INTO` is the load command

COPY INTO is the workhorse. It reads files from a stage, parses them according to a file format, and writes the rows into a target table. The straightforward case looks like this:

COPY INTO raw_events.event_raw
FROM @raw_events.staging_s3
FILE_FORMAT = (FORMAT_NAME = raw_events.csv_with_header)
ON_ERROR = ABORT_STATEMENT;

That covers about 60% of real pipelines. The remaining 40% is in the options.

ON_ERROR controls what happens when a row fails to parse or violates a constraint. ABORT_STATEMENT is the default and stops the entire load on the first failure — the right choice when every row matters and you’d rather fix the upstream than load partial data. CONTINUE skips the bad rows and loads the rest. SKIP_FILE skips the whole file. SKIP_FILE_<n> or SKIP_FILE_<n>% skips files that exceed an absolute or percentage error threshold. For backfills from messy historical data, SKIP_FILE_5% is a reasonable compromise; for real-time pipelines, ABORT_STATEMENT is what you want.

FORCE = TRUE tells COPY INTO to load files it has already loaded. Snowflake tracks loaded files in LOAD_HISTORY for 64 days per table and by default skips files whose name and size match a previous successful load. FORCE = TRUE bypasses that check and reloads everything — useful for replaying after a botched transform, dangerous in steady state because every replay duplicates rows.

PURGE = TRUE deletes files from the stage after a successful load. It’s how teams keep an internal stage from accumulating forever, but it’s a one-way operation — once Snowflake deletes the file, your replay options shrink to whatever’s left in LOAD_HISTORY. For external stages, lifecycle policies on the bucket are usually cleaner than PURGE.

VALIDATION_MODE runs the load in dry-run mode. RETURN_ERRORS returns every row that would fail without loading anything; RETURN_<n>_ROWS returns the first N rows that would load successfully. This is the right way to debug a new file format before pointing it at production data.

-- See what would fail before actually loading
COPY INTO raw_events.event_raw
FROM @raw_events.staging_s3
FILE_FORMAT = (FORMAT_NAME = raw_events.csv_with_header)
VALIDATION_MODE = RETURN_ERRORS;

-- See the first 100 rows that would load
COPY INTO raw_events.event_raw
FROM @raw_events.staging_s3
FILE_FORMAT = (FORMAT_NAME = raw_events.csv_with_header)
VALIDATION_MODE = RETURN_100_ROWS;

A pattern worth stealing: every COPY INTO in production should specify the file list explicitly (via FILES = ('a.csv', 'b.csv') or PATTERN = '.*[.]csv') rather than loading “everything in the stage.” Explicit lists are auditable. The pattern variant survives a midnight add of unrelated files to the bucket without silently picking them up.

Transforming during the load

COPY INTO accepts a SELECT instead of the bare stage, which lets you reshape the data on the way in — pick specific columns, cast types, parse a variant column, drop or rename fields, derive load timestamps. This is the right place to put light transformations because it avoids landing the data twice.

COPY INTO raw_events.event_clean (event_id, user_id, event_type, occurred_at, load_ts)
FROM (
  SELECT
    $1::STRING                         AS event_id,
    $2::NUMBER                         AS user_id,
    $3::STRING                         AS event_type,
    TO_TIMESTAMP($4, 'YYYY-MM-DD HH24:MI:SS') AS occurred_at,
    CURRENT_TIMESTAMP()                AS load_ts
  FROM @raw_events.staging_s3
)
FILE_FORMAT = (FORMAT_NAME = raw_events.csv_with_header)
PATTERN = 'events/2026/05/.*[.]csv'
ON_ERROR = ABORT_STATEMENT;

Three things in that snippet matter. $1, $2, $3, $4 reference positional columns from the file — when CSV column order matches the table, you can omit the column list entirely, but pinning it makes the load robust to reordering in the source. TO_TIMESTAMP with an explicit format string is the right way to handle non-default timestamp encodings rather than relying on Snowflake’s auto-detection. And the load timestamp gives you a column to filter on for incremental downstream jobs without trusting whatever timestamp the source happens to emit.

For semi-structured JSON, the same pattern landing into a VARIANT is the idiomatic load:

COPY INTO raw_events.event_json (raw_payload, load_ts, source_file)
FROM (
  SELECT
    $1                                 AS raw_payload,
    CURRENT_TIMESTAMP()                AS load_ts,
    METADATA$FILENAME                  AS source_file
  FROM @raw_events.staging_json
)
FILE_FORMAT = (FORMAT_NAME = raw_events.json_lines);

METADATA$FILENAME is the file the row came from, and METADATA$FILE_ROW_NUMBER is its position inside that file. Storing both turns a VARIANT table into a debuggable artifact — when a downstream parse fails, you can trace it back to a specific byte range of a specific file.

See what QueryPlane can build for you

Connect to your database, write SQL with AI, and build shareable apps — all from your browser.

Get Started Book a Demo

Snowpipe loads continuously

COPY INTO is fine when the orchestrator runs it on a schedule. When files arrive sporadically and you want them loaded within seconds without paying for a warehouse to sit warm, you want Snowpipe.

A pipe is a named object that wraps a COPY INTO statement. When the stage receives a file, an event notification (S3 SNS, GCS Pub/Sub, Azure Event Grid, or an explicit REST call) fires; Snowpipe runs the underlying COPY INTO against the new file using a serverless compute pool that Snowflake manages. You are billed per file by data volume and per-file processing cost, not by warehouse-second.

CREATE PIPE raw_events.event_pipe
  AUTO_INGEST = TRUE
  AS
  COPY INTO raw_events.event_raw
  FROM @raw_events.staging_s3
  FILE_FORMAT = (FORMAT_NAME = raw_events.csv_with_header);

For S3 auto-ingest, after creating the pipe you call SHOW PIPES to get the pipe’s notification ARN and attach an SNS or SQS notification on the bucket to it. The bucket-side configuration is the thing teams forget — without the notification subscription, the pipe exists but never fires, and files pile up in the stage with no errors anywhere.

The right diagnostics for pipes live in two places. COPY_HISTORY shows every load attempt, the row counts, and the error breakdown. PIPE_USAGE_HISTORY shows credit usage and ingestion rate. The query you’ll run more than any other:

SELECT
  pipe_name,
  DATE_TRUNC('hour', last_load_time) AS hour,
  COUNT(*)                           AS files_loaded,
  SUM(row_count)                     AS rows_loaded,
  SUM(error_count)                   AS error_rows,
  SUM(file_size) / 1024 / 1024       AS mb_loaded
FROM snowflake.account_usage.copy_history
WHERE pipe_name IS NOT NULL
  AND last_load_time >= DATEADD(day, -1, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 2 DESC;

There is one Snowpipe semantics rule that catches every team eventually: Snowpipe deduplicates by filename for 14 days. If the upstream writes events.csv once, then overwrites it in place with new content, the second write is silently dropped because Snowpipe already saw a file with that name. The fix is to ensure filenames are unique — typically a timestamp or ULID prefix — rather than trying to coerce Snowpipe into reloading. After 14 days the dedup window expires and a same-named file will load again, which is its own surprise.

File size is the other lever. The Snowpipe documentation recommends compressed files between 100 MB and 250 MB. Smaller files load with proportionally higher per-file overhead, both in latency and credit cost; larger files mean longer load times before the data is queryable. A pipeline that emits 10,000 tiny files an hour will spend more on per-file overhead than on actual ingestion compute. Either batch upstream into larger files or aggregate into a target sweet-spot range in a small landing job.

Snowpipe Streaming bypasses files entirely

The newer Snowpipe Streaming ingestion path is a row-level API that writes directly into Snowflake from a client SDK, without staging files at all. The use case is sub-second latency for high-volume event streams — clickstream, IoT, application telemetry — where the file-based ingestion path is too slow and too expensive at the per-event grain.

The two flavors are worth knowing. The classic Streaming Ingest SDK is a Java client that opens a channel to a table, batches rows in memory, and flushes them on a cadence; throughput is in the hundreds of MB/sec per client with end-to-end latency typically under a second. The newer Snowpipe Streaming with high-performance architecture uses Apache Iceberg under the hood and supports a REST interface, raising the throughput ceiling and supporting languages beyond Java.

A skeleton of the classic SDK usage in Java looks like:

SnowflakeStreamingIngestClient client = SnowflakeStreamingIngestClientFactory
    .builder("CLIENT_1")
    .setProperties(props)
    .build();

OpenChannelRequest request = OpenChannelRequest.builder("CHANNEL_1")
    .setDBName("events")
    .setSchemaName("raw")
    .setTableName("event_stream")
    .setOnErrorOption(OpenChannelRequest.OnErrorOption.CONTINUE)
    .build();

SnowflakeStreamingIngestChannel channel = client.openChannel(request);

Map<String, Object> row = Map.of(
    "event_id", "evt_" + UUID.randomUUID(),
    "user_id", 12345,
    "occurred_at", Instant.now().toString()
);
InsertValidationResponse resp = channel.insertRow(row, "offset_1");

The offset token ("offset_1") is the durability primitive. You assert a per-row offset when you insert, and the SDK exposes the latest committed offset per channel. On client restart you read the committed offset, resume from there, and Snowflake’s at-least-once semantics combined with idempotent offset assignment give you exactly-once delivery as long as your application orders writes deterministically.

Streaming is meaningfully cheaper per row than file-based Snowpipe at high volume because there is no per-file overhead — you pay for bytes ingested plus a small flat per-channel cost. The flip side is that the application has to be running, holding open channels, and managing offset state, which is a different operational shape than “drop a file in a bucket.”

Idempotency: building loads that survive replays

The single most important property of a Snowflake load pipeline is that re-running the same load produces the same result. Production data engineering is full of partial failures — an orchestrator retries a step that already partially succeeded, a bucket gets re-synced, a backfill overlaps a normal run — and a pipeline that double-loads on retries is a pipeline that ships incorrect numbers to dashboards.

Three patterns cover most cases.

Filename uniqueness + LOAD_HISTORY dedup is the default behavior. Snowflake skips files whose (name, size, modification_time) matches a previous successful load on the same table for 64 days. If your upstream emits unique filenames (timestamp prefix, ULID, or instance-id/seq) and you do not pass FORCE = TRUE, replays are safe at the file granularity. This is enough for most batch pipelines.

MERGE deduplication on a load key handles the case where the same logical row arrives across multiple files. If your source emits an event with a stable event_id, the load lands rows into a staging table and a follow-up MERGE upserts into the final table on event_id. This is also the right pattern when the schema involves late-arriving updates — a new row with the same key replaces the old.

MERGE INTO events.event_final t
USING (
  SELECT event_id,
         user_id,
         event_type,
         occurred_at,
         MAX(load_ts) AS load_ts
  FROM events.event_staging
  WHERE load_ts >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
  GROUP BY 1, 2, 3, 4
) s
ON t.event_id = s.event_id
WHEN MATCHED AND s.load_ts > t.load_ts THEN UPDATE SET
  user_id     = s.user_id,
  event_type  = s.event_type,
  occurred_at = s.occurred_at,
  load_ts     = s.load_ts
WHEN NOT MATCHED THEN INSERT (event_id, user_id, event_type, occurred_at, load_ts)
  VALUES (s.event_id, s.user_id, s.event_type, s.occurred_at, s.load_ts);

The s.load_ts > t.load_ts guard on WHEN MATCHED is what makes the MERGE itself idempotent — replaying the same staging window does not overwrite a newer row that arrived in between. Without it, an out-of-order replay can move the final row backwards.

Streams + tasks for incremental processing is the right pattern for downstream transforms. The stream sits on the staging table, the task consumes from it on a schedule, and the stream’s offset advances only when the consuming MERGE commits. Replays are safe because the stream remembers what’s been read. See our guide to Snowflake streams and tasks for the full pattern.

The combination — unique filenames, MERGE on a stable key with a load_ts guard, and streams for downstream propagation — gives you a pipeline that survives any retry policy your orchestrator decides to invent.

Costs: where the credits actually go

There are three meaningful cost lines in a load pipeline.

Warehouse compute for COPY INTO is straightforward — the load runs on whatever warehouse the session uses, billed per second. The right size is usually the smallest warehouse that finishes the load before the next file arrives; for most file-based loads under a few GB, an X-Small is enough. Bigger warehouses parallelize across files (each file goes to a single worker, so beyond N files = N workers, you’re paying for empty cores). Sizing decisions for COPY INTO follow the same logic as the rest of Snowflake — see our warehouse sizing post for the full methodology.

Snowpipe per-file overhead is the one most teams underestimate. Each Snowpipe load has a fixed accounting cost on top of the per-byte cost, which means thousands of tiny files cost dramatically more than the equivalent bytes in a few right-sized files. The fix is to either batch upstream so files arrive at 100-250 MB compressed, or to run a small landing job that combines tiny files into the target range before Snowpipe picks them up.

Streaming bytes + per-channel cost for Snowpipe Streaming. There is no per-file overhead, but you pay a small flat rate per channel per hour and per byte ingested. For high-volume event streams (millions of rows per hour) Streaming wins on cost; for occasional small writes, file-based Snowpipe is usually cheaper.

The single query that surfaces actual loading cost across modes:

SELECT
  DATE_TRUNC('day', start_time)              AS day,
  pipe_name,
  SUM(credits_used)                          AS pipe_credits,
  SUM(bytes_inserted) / 1024 / 1024 / 1024   AS gb_loaded,
  SUM(files_inserted::NUMBER)                AS files_loaded
FROM snowflake.account_usage.pipe_usage_history
WHERE start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1 DESC, pipe_credits DESC;

Pair that with the warehouse cost from WAREHOUSE_METERING_HISTORY for non-pipe loads, and you have a clean picture of cost per ingestion mode. The pattern teams usually find: a small number of high-frequency pipes dominate Snowpipe spend, and rebatching their input files is the highest-ROI optimization available.

Common pitfalls

Loading directly into the production table. A failed mid-load COPY INTO can leave the target table with partial state — COPY INTO is atomic per file, not per statement. The safer pattern is to load into a staging table, validate row counts, and MERGE or INSERT into the final table only if validation passes. The cost is one extra hop; the benefit is that a malformed file never touches downstream.

Trusting auto-detected schemas on JSON. JSON’s flexibility is also its trap — a column that has been STRING for six months can quietly become NUMBER for one row because the source emitted an unquoted value, and the auto-detected schema on COPY INTO with MATCH_BY_COLUMN_NAME will start writing nulls. Land semi-structured data into a single VARIANT column and parse it explicitly in the next stage. The VARIANT is the schema-evolution boundary.

Misunderstanding FORCE versus PURGE. FORCE = TRUE reloads files that have already been loaded, which duplicates rows unless your downstream MERGE dedupes. PURGE = TRUE deletes files after loading, which prevents future replays. The combination of FORCE = TRUE without PURGE is the most common cause of accidental duplicates; the combination of PURGE without unique filenames is the most common cause of losing the ability to replay. Default to neither, and add only with deliberate reason.

Tiny files in Snowpipe. A Kafka Connect job emitting one file per partition per minute will produce hundreds of files per minute, each carrying full Snowpipe per-file overhead. The fix is either upstream batching or a staging buffer that combines files before they hit the pipe. Either way, the cost of getting this wrong shows up as a Snowpipe credit line item that doesn’t track the volume of data being loaded.

External stages and bucket lifecycle. If the bucket has a lifecycle policy that expires files before Snowflake’s LOAD_HISTORY does, you cannot replay even if the table-side LOAD_HISTORY would allow it. Worse, if MATCH_BY_COLUMN_NAME is mapping by name on Parquet and an upstream schema change adds or renames columns, the next file might load wildly wrong data unless you explicitly check. Always pair external stages with at least 30 days of object-version retention.

Loading into a clustered table during heavy INSERT traffic. Clustering keys cause Snowflake to reclassify micro-partitions in the background; very frequent loads into a clustered table can produce a lot of reclassification work and inflate Automatic Clustering credits. For high-rate streaming, consider whether the destination table actually needs clustering, or whether a downstream INSERT into a clustered table from an unclustered staging table is cheaper.

Frequently asked questions

What’s the difference between a Snowflake stage and a file format? A stage is where the files live — a table stage, user stage, internal stage, or external stage pointing at S3/GCS/Azure. A file format is how Snowflake parses them — CSV with these delimiters, JSON with this null handling, Parquet with the vectorized reader. A stage can have a default file format attached, but you can override it at COPY INTO time, and the same file format can be reused across many stages and pipes.

When should I use an internal stage vs an external stage? External stages (pointing at your own S3/GCS/Azure bucket) are the default for production pipelines because the upstream job almost always writes to your own storage anyway. Internal stages are useful for ad-hoc loads, for users who can’t easily write to your cloud storage, or as a simple staging area when you don’t want to provision a bucket. Storage cost is the same; access semantics are different — external stages let you use bucket policies and storage integrations, internal stages use Snowflake grants.

What does COPY INTO do that INSERT doesn’t? COPY INTO reads files from a stage, parses them with a file format, and writes the rows in bulk. INSERT writes literal row values from a SQL statement. For loading large volumes, COPY INTO is dramatically faster because it parallelizes across files and uses Snowflake’s bulk-load path; INSERT is fine for handfuls of rows but does not scale. There is also INSERT INTO ... SELECT ... FROM @stage, which can be useful for small transformations but uses the bulk path internally and is functionally close to COPY INTO with a SELECT.

How does Snowpipe know when new files arrive? For auto-ingest pipes, you attach an event notification on the bucket (S3 SNS or SQS, GCS Pub/Sub, Azure Event Grid) that points at the pipe’s notification channel. When a file is created or updated, the bucket fires the event, and Snowpipe runs the pipe’s underlying COPY INTO against the new file. There is also a REST endpoint (insertFiles) for when event notifications aren’t available — your loader explicitly tells the pipe which files to load.

Why is Snowpipe loading the same file twice? It usually isn’t — Snowpipe deduplicates by filename for 14 days. If you’re seeing duplicate rows, the cause is almost always that the source emitted the same content in two differently-named files, or that you ran a COPY INTO ... FORCE = TRUE in addition to the pipe. Check COPY_HISTORY filtered to the pipe’s name and the affected time range; the file_name column shows exactly which files were loaded.

How small a file is too small for Snowpipe? Snowflake recommends compressed files between 100 MB and 250 MB. Files under ~10 MB pay disproportionate per-file overhead in both latency and credit cost. If your upstream emits small files, either batch them in a staging job before the pipe sees them, or switch the high-volume part of the workload to Snowpipe Streaming, which has no per-file cost.

When should I use Snowpipe Streaming instead of Snowpipe? Snowpipe Streaming wins when you have a high-volume event stream (typically tens of thousands of rows per second or more), need sub-second latency, and have a long-running client application that can hold open channels and manage offset state. Snowpipe (file-based) wins when files are already being produced by an upstream system you don’t control, when the volume is low enough that per-file overhead doesn’t dominate, and when you don’t want to operate a streaming client.

How do I make my Snowflake load idempotent? Three things: emit unique filenames upstream so Snowflake’s built-in LOAD_HISTORY dedupes file-level retries, land into a staging table and MERGE into the final table on a stable business key with a load_ts guard so re-running the same window doesn’t overwrite newer rows, and use streams to drive downstream consumption so offset advancement is tied to commit. Avoid FORCE = TRUE outside deliberate replays.

Why does COPY INTO skip files I want it to load? Snowflake’s LOAD_HISTORY tracks loaded files for 64 days per table and skips files whose name and size match a previous load. If you renamed a file or its byte size is identical to a previously-loaded file, COPY INTO will skip it silently. The fastest diagnostic is SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(...)) for the affected table — it shows skipped files explicitly. Force a reload with FORCE = TRUE if you actually want to reload.

What’s the right warehouse size for COPY INTO? Usually X-Small. COPY INTO parallelizes by file, so a warehouse with N workers can process at most N files concurrently. For loads of a few files at a time, larger warehouses pay for cores that have nothing to do. The exception is one very large file that decompresses and parses serially per worker; if you have a 50 GB Parquet file as a single load unit, a larger warehouse helps. Otherwise: X-Small, measure, only move up if EXECUTION_TIME is genuinely limiting.

Wrapping up

Snowflake’s loading surface is wide because real ingestion is wide — internal vs external storage, batch vs continuous, files vs streaming events, transformation during load vs after. The good news is the primitives compose: stages and file formats describe what’s there and how to read it, COPY INTO is the load command, Snowpipe wraps COPY INTO for event-driven ingestion, and Snowpipe Streaming bypasses files entirely for high-rate event streams.

The patterns that survive in production are the unglamorous ones — unique filenames, named file formats, explicit VALIDATION_MODE runs against new sources, MERGE with a load_ts guard for idempotency, file sizes in the 100-250 MB range, and staging tables between the load and the production table. Skip any of those and you eventually hit a replay scenario where the pipeline corrupts a downstream metric, and the cost of the cleanup dwarfs the cost of doing it right the first time.

If you’re spending real time inside Snowflake, our guides to warehouse sizing, the query profile, clustering keys, streams and tasks, dynamic tables, zero-copy clones, and RBAC and secure views cover the other levers that compound with the loading patterns here. And if you want a fast SQL editor for exploring COPY_HISTORY, PIPE_USAGE_HISTORY, and LOAD_HISTORY while you tune a pipeline, the QueryPlane Snowflake integration connects in a few minutes.