Skip to content

Parse read_nsrdb_psm4 header with the csv module to keep quoted commas (fixes #2736)#2771

Open
gaoflow wants to merge 1 commit into
pvlib:mainfrom
gaoflow:fix-2736-nsrdb-psm4-quoted-columns
Open

Parse read_nsrdb_psm4 header with the csv module to keep quoted commas (fixes #2736)#2771
gaoflow wants to merge 1 commit into
pvlib:mainfrom
gaoflow:fix-2736-nsrdb-psm4-quoted-columns

Conversation

@gaoflow
Copy link
Copy Markdown

@gaoflow gaoflow commented Jun 2, 2026

What this fixes

read_nsrdb_psm4 parsed its three header lines with a naive str.split(','):

metadata_fields = fbuf.readline().split(',')
metadata_values = fbuf.readline().split(',')
columns        = fbuf.readline().split(',')

The NSRDB spectral-on-demand CSVs reported in #2736 have quoted column
names that contain commas, e.g.

..., "GaAs (Bauhuis et al., 2009)","InGaP (Gray, 2008)", ...

These are valid CSV (the commas are inside quotes), and pandas.read_csv
parses the data rows correctly — but str.split(',') splits each quoted name
into multiple fragments, inflating the column count. The mismatch between the
mis-split names/usecols and the correctly-parsed data then raises on read.

The change

Parse the three header lines with the csv module (which honors quoting)
instead of str.split(','). For ordinary (unquoted) files this is identical
to the previous behavior, so the existing readers are unaffected.

This addresses the parsing crash that @kandersolar confirmed should be
supported. The further map_variables=True unit handling for spectral files
(W/m²/µm → W/m²/nm) mentioned in the issue is a separate enhancement and is
left out of scope here.

Reproduction (before this PR)

from io import StringIO
from pvlib.iotools import psm4

content = (
    "Source,Location ID,City,State,Country,Latitude,Longitude,Time Zone,"
    "Elevation,Local Time Zone,Version\n"
    "NSRDB,1,-,-,-,40.0,-105.0,-7,1600,-7,4.0.1\n"
    'Year,Month,Day,Hour,Minute,GHI,"GaAs (Bauhuis et al., 2009)",'
    '"InGaP (Gray, 2008)"\n'
    "2023,1,1,0,0,0,0.1,0.2\n"
    "2023,1,1,1,0,5,0.3,0.4\n"
)
psm4.read_nsrdb_psm4(StringIO(content), map_variables=False)
# ParserError: Too many columns specified: expected 10 and found 8

After the fix the quoted columns survive intact
('GaAs (Bauhuis et al., 2009)', 'InGaP (Gray, 2008)').

A regression test (test_read_nsrdb_psm4_quoted_columns_with_commas) is added
that fails on main and passes with this change; the existing
read_nsrdb_psm4 tests continue to pass.

read_nsrdb_psm4 split the three header lines with a naive str.split(','),
which broke spectral-on-demand files whose column names are quoted fields
containing commas (e.g. '"GaAs (Bauhuis et al., 2009)"'). Such names were
split into spurious columns, raising on read. Parse the header lines with
the csv module so quoted fields are kept intact.

Fixes pvlib#2736
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue with importing NSRDB spectral-on-demand files with pvlib.iotools.read_nsrdb_psm4

1 participant