How To Check Parquet File Size Limits in Azure Storage

Select Language:

If you’re working with an Azure Data Factory (ADF) pipeline to load a parquet file into a database table, and you’re noticing that the file size slightly increases from one day to the next, it’s usually a sign of added data or minor updates. For example, your file was 16,384 KB on Friday, and now it’s 16,387 KB. This small increase suggests extra data or some new entries were added.

In your case, the file is an exact copy of data from another system, with changes mainly involving site names, enabling/disabling sites, or updating site attributes. You mentioned that six new sites were added, with no other modifications.

Recently, you encountered an error during the load process that looks like this:

ErrorCode=DelimitedTextMoreColumnsThanDefined, ‘Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing ‘Csv/Tsv Format Text’ source {filename} with row number 24702: found more columns than expected column count 30.,Source=Microsoft.DataTransfer.Common,’

This error usually points to the data having more columns than expected. Interestingly, after opening the file with Excel, you confirmed there are no extra columns. The issue likely isn’t with the column count but might be related to the data rows that are causing the error.

You also noted that the files that failed contain four more data rows than the last successful load. Sometimes, errors like this happen because of subtle data issues, such as:

Extra delimiters like commas or tabs within the data fields
Missing closing quotes around text fields
Inconsistencies in data formatting

To troubleshoot and fix this problem:

Check the Data Rows: Look closely at the problematic rows (around row 24,702) in the failing file. See if any fields contain extra commas, tabs, or quotes that could cause the parser to misinterpret the columns.
Validate the Formatting: Ensure all text fields are properly enclosed in quotes, especially if they contain delimiters. Any missing or extra quotes can throw off the column count.
Use a Text Editor or Data Validation Tools: Open the full file in a text editor designed for large files, like Notepad++ or VS Code, to examine the specific rows more easily.
Compare Files: Since the only recent changes involve added sites, compare the last successful file with the current, problematic file. Look for differences in data entries around the problematic rows.
Test with a Subset: Create a smaller version of the file that includes just the failed rows and see if the error persists. This can help isolate the issue.
Review Data Generation: If the data files are generated automatically, double-check the generation process to ensure it handles special characters correctly.

Following these steps should help you pinpoint the cause of the “more columns than expected” error. Once identified, cleaning up the data so that it conforms to the expected format will allow the pipeline to load without issues again.