Skip to content

USE 531 - Add "fulltexts" data type#189

Merged
ghukill merged 2 commits into
mainfrom
USE-531-fulltext-data-type-add
May 20, 2026
Merged

USE 531 - Add "fulltexts" data type#189
ghukill merged 2 commits into
mainfrom
USE-531-fulltext-data-type-add

Conversation

@ghukill
Copy link
Copy Markdown
Contributor

@ghukill ghukill commented May 18, 2026

Purpose and background context

This PR adds a new "fulltexts" data type to the TIMDEX dataset, sitting along pre-existing data types "records" and "embeddings".

This data type add leverages the new data type parity, requiring basically just a schema + metadata definition, relying on shared code for all data types.

This "fulltext" rows capture fulltext associated with a TIMDEX record. At this time, only DSpace theses will have fulltext records, but we may explore including LibGuides and MITLib websites fulltext in here as well (currently they are added directly to the Opensearch document by Transmogrifier, making them searchable, but not available in the dataset for any additional uses). We may also explore fulltext for other sources, e.g. archival materials, Dome materials, etc.

Unlike embeddings, there is not much metadata about fulltexts at this time. We do include an MD5 checksum of the fulltext bytes to avoid re-harvesting / re-writing fulltext in the future, but that is not currently utilized (though CLI app dspace-fulltext-harvester may use it). The primary data column is simply fulltext with the fulltext of the record as bytes.

The first planned use of this new data type is this:

  1. For source dspace, run the dspace-fulltext-harvester as part of the TIMDEX ETL StepFunction
  2. Write data back to this new data type
  3. Update TIM to update Opensearch records with fulltext stored here, very similar to how we did embeddings

How can a reviewer manually see the effects of these changes?

1- Set Dev TimdexManagers Credentials

2- Set env vars:

TDA_LOG_LEVEL=DEBUG
WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3,MARKDOWN
TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset

3- Load dataset:

import os

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])

4- Write batch of fulltexts records, using an arbitrary researchdatabases ETL run as the source TIMDEX records:

from tests.utils import generate_sample_fulltexts_for_run

# utility to generate some fulltexts
fulltexts = generate_sample_fulltexts_for_run(
    timdex_dataset=td,
    run_id="18cbec87-ce25-4355-a526-caaa3eb1ac1e",  # run for 'researchdatabases'
)

# write to dataset, using recently normalized 'td.<source>.write(...)'
td.fulltexts.write(fulltexts)
# INFO:timdex_dataset_api.data_type:Dataset write complete - elapsed: 12.0s, total files: 1, total rows: 901, total size: 101180

5- Confirm our write worked and we can see records:

# get all fulltexts for 'researchdatabases' source
td.fulltexts.read_dataframe(table='current_fulltexts')

# get a single 'fulltexts' record (where the 'xxxxx...' is just filler text)
next(td.fulltexts.read_dicts_iter(table='current_fulltexts'))
"""
Out[12]: 
{'timdex_record_id': 'researchdatabases:az-65257807',
 'source': 'researchdatabases',
 'run_date': datetime.datetime(2026, 5, 1, 0, 0),
 'run_type': 'full',
 'action': 'index',
 'run_id': '18cbec87-ce25-4355-a526-caaa3eb1ac1e',
 'run_record_offset': 0,
 'run_timestamp': datetime.datetime(2026, 5, 1, 15, 24, 31, 248765, tzinfo=<DstTzInfo 'America/Detroit' EDT-1 day, 20:00:00 DST>),
 'filename': 's3://timdex-extract-dev-222053980223/dataset/data/fulltexts/year=2026/month=05/day=01/ab02e407-b62d-4807-8f3b-7dd1c3a4d6ad-0.parquet',
 'fulltext_timestamp': datetime.datetime(2026, 5, 1, 15, 24, 31, 248765, tzinfo=<DstTzInfo 'America/Detroit' EDT-1 day, 20:00:00 DST>),
 'fulltext_md5': '48dac3804fbe73d3aeb1e272e0460045',
 'fulltext': b'Sample fulltext content for researchdatabases:az-65257807.  Content xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'}
"""

Note that the record returned, and as we'll see below in step 6, contain additional metadata about the TIMDEX record like [source, run_date, action, ...] etc. This is a byproduct of the v5 refactoring, making all data types in the dataset see very similar "base" metadata for each row, specifically the record it's associated with.

6- Confirm the fulltext_md5 is a queryable / filterable metadata field, not a data field:

td.conn.query("""select * from metadata.fulltexts where fulltext_md5 = '967d0a0bf60829bee5dfc05476cc0f0f';""")
# NOTE: there may be multiple rows for this MD5 checksum, given multiple writes

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

The TIMDEX dataset originally held "records".  Then we bolted on "embeddings" as
another data type in the dataset, but with its own code paths.  Then we refactored
so "records" and "embeddings" had the same signatures and shared code.

Now, we're ready for a new data type "fulltexts": full text, not just metadata, that
is associated with a record.  This data type will have very little metadata beyond
associating with a specific TIMDEX record version.

How this addresses that need:

Adds new module `data_types/fulltexts.py` with the main class `TIMDEXFulltexts`.  This
class follows the plural naming conventions of "records" and "embeddings".

The primary data column for this data type is "fulltext", the actual text payload.  A
metadata column `fulltext_md5` is also added to fingerprint the text and provide
a way to avoid storing the fulltext again if not needed.

Side effects of this change:
* Arguably, none really.  The TIMDEX dataset gets a new data type, but it's
inconsequential until used.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-531
@ghukill ghukill changed the base branch from main to USE-531-fulltext-data-type May 18, 2026 18:53
@ghukill ghukill marked this pull request as ready for review May 18, 2026 19:12
@ghukill ghukill requested a review from a team as a code owner May 18, 2026 19:12
@ghukill ghukill changed the base branch from USE-531-fulltext-data-type to main May 18, 2026 20:59
Additionally, move data type related imports back to root namespace for
backwards compatibility.
@ghukill ghukill force-pushed the USE-531-fulltext-data-type-add branch from 7dd5850 to 7fe726f Compare May 19, 2026 18:12
@ehanson8 ehanson8 self-assigned this May 20, 2026
Copy link
Copy Markdown

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected and code looks good!

@ghukill ghukill merged commit 450af3c into main May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants