Can I still automate ingestion if my source data is inconsistent ? - A solution using Snowflake's infer schema

Digital Hive
Jun 2
4 min read

When dealing with real world data we often face inconsistencies in source data structure, especially when those sources are file based structured or unstructured files. Snowflake provides a solution for tackling these issues. Here we dive into how you can safely handle source data inconsistencies using the Snowflake tool stack.

Originally intended to infer schemas from semi-structured or structured files during prototyping, Snowflakes INFER_SCHEMA can easily be implemented in production settings.

What is the problem

Imagine having a steady snowpipe that has not needed any updating or maintenance since it was created. Structured data files are being dropped on an external stage, and the pipe gets triggered and ingests the data into the predefined columns in your Snowflake table.

Everything is going smoothly until you get a new request from the marketing team for several new fields to be included. Now you need to not only update the pipe but also your destination table.

What is the best way to tackle the issue ?

One approach is to have your file content fully ingested into a file content column. This can be easily implemented. However, in the case of a generic medallion architecture, this approach will defer the issue from the bronze (raw) layer to the silver layer. Happy ingestion team, not so happy transformation team.

A more sustainable approach is to allow your source some flexibility and be able to handle this in the ingestion process. Here Snowflake offers a solution that is based on automatically detecting schemas and column definitions in structured and semi structured data files: INFER_SCHEMA.

Requirements

Stage: Your data file must be in a named stage (internal or external) or user stage.
Target Files: The LOCATION argument must point to one or more files. You can provide the full path to a specific file or point to a directory containing multiple file and use the MAX_FILE_COUNT argument to limit the amount of files that will be used to infer the schema.
File formats: INFER_SCHEMA works with semi-structured formats only which are the following files structures: Parquet, Avro, ORC, JSON, and CSV. This is not extension dependent. .txt files containing csv or json data can also be handled. The same goes for zipped files (.gz). You must specify a pre-defined FILE_FORMAT object, even for common types like JSON or CSV. This is how Snowflake knows how to interpret the raw files.

How does it work ?

Here we go over a typical case where we want to have the latest file dropped in an external stage to be ingested into a table in Snowflake. Historical data is not kept in this approach. This case is a good representation of ingesting reference data. We will work out the approach for ingesting csv files.

Create a file format

This file format will be used to parse the files staged at a given path.

1. CREATE FILE FORMAT my_csv_format

2. TYPE = csv

3. PARSE_HEADER = TRUE;

Note: Use PARSE_HEADER = TRUE in your file format to infer actual column names instead of defaulting to c1, c2, etc.

Create table using template

1. CREATE TABLE mytable

2. USING TEMPLATE (

3. SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))

4. FROM TABLE(

5. INFER_SCHEMA(

6. LOCATION=>'@mystage/csv/',

7. FILE_FORMAT=>'my_csv_format'

8. )

9. ));

The INFER_SCHEMA function in combination with the USING TEMPLATE clause is used to auto-generate a table schema from file metadata. INFER_SCHEMA scans staged data (e.g., csv files) and returns metadata including column names, data types, nullability, and more*. This schema array is passed into USING TEMPLATE, which allows the construction of a table based on the schema definition.

Copy data into the created table

1. COPY INTO mytable

2. FROM '@mystage/csv/'

3. FILE_FORMAT = (FORMAT_NAME = 'my_csv_format')

4. MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

The COPY INTO command is used to load the data from @mystage/csv/, using the predefined file format, and loads it into mytable.

The clause MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE ensures that the mapping between source and target columns is done by name, ignoring case differences.

PURGE = TRUE automatically deletes the source file after it has been successfully ingested, preventing duplicate loads and saving storage.

Automate the process

You can wrap the ingestion logic inside a SQL stored procedure or script and schedule it to run at regular intervals. Tasks can be time-based using CRON expressions or event based.

1. CREATE OR REPLACE TASK infer_and_ingest_task

2. WAREHOUSE = my_wh

3. SCHEDULE = 'USING CRON 0 * * * * UTC' -- every hour

4. AS

5. BEGIN

6. CREATE OR REPLACE TABLE mytable

7. USING TEMPLATE (

8. SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))

9. FROM TABLE(

10. INFER_SCHEMA(

11. LOCATION=>'@mystage/csv/',

12. FILE_FORMAT=>'my_csv_format'

13. )

14. )

15. );

16.  

17. COPY INTO mytable

18. FROM '@mystage/csv/'

19. FILE_FORMAT = (FORMAT_NAME = 'my_csv_format')

20. MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE

21. PURGE = TRUE;

22. END; 

This makes schema-flexible ingestion pipelines fully automatable with little maintenance overhead. For large or enterprise workflows, you can integrate these tasks into orchestration platforms like Apache Airflow or dbt Cloud.

Pro tip

Instead of using the inferred data types, hardcode column definitions using OBJECT_CONSTRUCT and manually assign all columns to a VARCHAR datatype. See the example below.

1. CREATE TABLE mytable

2. USING TEMPLATE (

3. SELECT ARRAY_AGG(

4. OBJECT_CONSTRUCT(

5. 'COLUMN_NAME', COLUMN_NAME,

6. 'TYPE', 'VARCHAR',

7. 'NULLABLE', NULLABLE

8. )

9. )

10. FROM TABLE(

11. INFER_SCHEMA(

12. LOCATION=>'@mystage/csv/',

13. FILE_FORMAT=>'my_csv_format'

14. )

15. ));

*INFER_SCHEMA metadata

Limitations

All the variations of timestamp data types (DATE_FORMAT, TIME_FORMAT, and TIMESTAMP_FORMAT) are retrieved as TIMESTAMP_NTZ without any time zone information.
For CSV files, SKIP_HEADER does not work in schema inference mode. Use PARSE_HEADER = TRUE in your file format instead.
Using * for ARRAY_AGG(OBJECT_CONSTRUCT()) can cause errors if the returned result is larger than 16MB. Only use the required columns, COLUMN NAME, TYPE, and NULLABLE, for the query.
For files with nested data, only the first level of nesting is supported.

Conclusion

If your source data structure is inconsistent or evolves frequently, don’t fight it. Adapt your pipeline using INFER_SCHEMA and Snowflake Tasks. This approach removes the friction from handling schema drift, eliminates the need for frequent manual DDL updates, and allows your ingestion pipelines to stay responsive to upstream changes. With automation via Snowflake Tasks, dynamic table creation, and flexible ingestion strategies, your data platform can evolve as fast as your source data does.

References

File formats: https://docs.snowflake.com/en/sql-reference/sql/create-file-format

Infer schema: https://docs.snowflake.com/en/sql-reference/functions/infer_schema

Copy into: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table

Data loading: https://docs.snowflake.com/en/user-guide/data-load-overview#label-detect-column-definitions-in-semi-structured-data-files

Can I still automate ingestion if my source data is inconsistent ? - A solution using Snowflake's infer schema

What is the problem

What is the best way to tackle the issue ?

Requirements

How does it work ?

Create a file format

Create table using template

Copy data into the created table

Automate the process

Pro tip

Limitations

Conclusion

References

Recent Posts

Comments