Skip to content

Validation_goldens modfication#2043

Open
niveditasing wants to merge 6 commits into
datacommonsorg:masterfrom
niveditasing:validation_golden_fix
Open

Validation_goldens modfication#2043
niveditasing wants to merge 6 commits into
datacommonsorg:masterfrom
niveditasing:validation_golden_fix

Conversation

@niveditasing

@niveditasing niveditasing commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary
This PR fixes issues in our validation pipeline, including data filtering mismatches, malformed CSV output formatting, file path resolution errors, and CSV/MCF parsing crashes.

  1. runner.py — Dynamic Path Resolution
  • Problem: Running validations from the repository root (e.g. during automated CI runs) broke relative paths like "golden_data/un_wpp.csv", resulting in reading 0 Data

  • Fix:

    • Introduced helper functions _is_relative_local and _find_base_dir to traverse the directory tree upward (up to 10 levels) to locate the golden_data/ directory relative to the config file or working directory.
    • Dynamically resolves golden_files and input_files to absolute paths before dispatching GOLDENS_CHECK.
    • Added corresponding unit tests in runner_test.py.
  1. validator_goldens.py — Safe CSV Serialization and Namespace Stripping
  • Problem:

    1. The existing utility wrote CSV data without wrapping text in quotes, causing layout corruption on fields containing internal commas or quotes.
    2. Input files containing raw strings prefixed with dcid: (e.g., dcid:Earth) failed to generate goldens because of exact match requirements against normalized schema property lists.
  • Fix:

    • Replaced the generic dictionary CSV writer with csv.DictWriter explicitly configured with quoting=csv.QUOTE_NONNUMERIC, quotechar='"', and escapechar='\'.
    • Updated load_nodes_from_file to automatically sanitize and strip any leading dcid: namespaces from loaded CSV string cell values during parsing, keeping headers intact.
    • Added test coverage in validator_goldens_test.py to assert data loading with dcid: prefixed values.

Verification and Testing Performed:

tested two imports to determine if the code is working. Below are the validation_output results, one showing a failure and the other showing success.

  1. WorldDevelopmentIndicators : https://storage.mtls.cloud.google.com/datcom-import-test/scripts/world_bank/wdi/WorldDevelopmentIndicators/2026_06_07T12_58_40_686171_07_00/input0/validation/validation_output.csv
  2. EurostatData_Fertility: https://storage.mtls.cloud.google.com/datcom-import-test/scripts/eurostat/regional_statistics_by_nuts/fertility_rate_mother_age/EurostatData_Fertility/2026_06_07T23_36_15_942332_07_00/input0/validation/validation_output.csv

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates validator_goldens.py to support namespace-stripped matching when generating goldens and replaces the helper function for writing CSVs with custom csv.DictWriter logic. A review comment points out that extracting CSV column headers from only the first node in golden_nodes assumes all nodes share the same keys, which can lead to errors or missing columns if keys vary. It suggests collecting the union of all keys across all nodes instead.

Comment thread tools/import_validation/validator_goldens.py
@balit-raibot

Copy link
Copy Markdown
Contributor

@gemini-code-assist review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates tools/import_validation/validator_goldens.py to import the csv module, strip namespaces when matching nodes against must_include_values, and replace the helper file_write_csv_dict with a custom csv.DictWriter implementation. The feedback suggests adding newline='' when opening the CSV file to prevent platform-specific carriage return issues and simplifying the logic for extracting and sorting unique keys.

Comment thread tools/import_validation/validator_goldens.py Outdated
@niveditasing

Copy link
Copy Markdown
Contributor Author

@gemini-code-assist review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces path resolution logic for summary report rules in the validation runner, cleans up CSV node keys during loading, and updates golden file generation to use Python's standard csv.DictWriter. Feedback focuses on adhering to PEP 8 line length limits, replacing debug print statements with proper logging.debug calls, and opening CSV files with newline='' to ensure cross-platform compatibility on Windows.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tools/import_validation/runner.py Outdated
Comment thread tools/import_validation/validator_goldens.py Outdated
Comment thread tools/import_validation/validator_goldens.py
@niveditasing

Copy link
Copy Markdown
Contributor Author

@gemini-code-assist review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces relative path resolution for the GOLDENS_CHECK validator and improves CSV parsing robustness when handling values containing colons (such as DCIDs). Feedback on these changes suggests using repository marker files (like .git or WORKSPACE) for more robust root path resolution, replacing debug print statements with proper logging.debug calls, and streaming CSV files directly to csv.DictReader to avoid reading entire files into memory.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tools/import_validation/runner.py Outdated
Comment thread tools/import_validation/runner.py Outdated
Comment thread tools/import_validation/validator_goldens.py Outdated
@niveditasing niveditasing force-pushed the validation_golden_fix branch from c3a1f36 to ef96648 Compare June 8, 2026 10:48
@niveditasing niveditasing force-pushed the validation_golden_fix branch from ef96648 to 81a632b Compare June 8, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants