Skip to content

Latest commit

 

History

History
47 lines (35 loc) · 2.51 KB

File metadata and controls

47 lines (35 loc) · 2.51 KB
title Troubleshoot: Reading UTF-8 text from CSV or PARQUET files using serverless SQL pool
description Reading UTF-8 text from CSV or PARQUET files using serverless SQL pool in Azure Synapse Analytics
author Danzhang-msft
ms.author danzhang
ms.topic troubleshooting
ms.service azure-synapse-analytics
ms.subservice troubleshooting
ms.date 02/27/2025

Troubleshoot reading UTF-8 text from CSV or Parquet files using serverless SQL pool in Azure Synapse Analytics

This article provides troubleshooting steps for reading UTF-8 text from CSV or Parquet files using serverless SQL pool in Azure Synapse Analytics.

When UTF-8 text is read from a CSV or PARQUET file using serverless SQL pool, some special characters like ü and ö are incorrectly converted if the query returns varchar columns with non-UTF8 collations. This is a known issue in SQL Server and Azure SQL. Non-UTF8 collation is the default in Synapse SQL so customer queries will be affected. Customers who use standard English characters and some subset of extended Latin characters might not notice the conversion errors. The incorrect conversion is explained in more detail in Always use UTF-8 collations to read UTF-8 text in serverless SQL pool.

Workaround

The workaround to this issue is to always use UTF-8 collation when reading UTF-8 text from CSV or PARQUET files.

  • In many cases, you just need to set UTF8 collation on the database (metadata operation).

    alter database MyDB
          COLLATE Latin1_General_100_BIN2_UTF8;
  • You can explicitly define collation on varchar column in OPENROWSET or external table:

    select geo_id, cases = sum(cases)
    from openrowset(
            bulk 'latest/ecdc_cases.parquet', data_source = 'covid', format = 'parquet'
        ) with ( cases int,
                 geo_id VARCHAR(6) COLLATE Latin1_General_100_BIN2_UTF8 ) as rows
    group by geo_id
  • If you didn't specify UTF8 collation on external tables that read UTF8 data, you need to re-create impacted external tables and set UTF8 collation on varchar columns (metadata operation).

Related content