USE 531 - Add "fulltexts" data type by ghukill · Pull Request #189 · MITLibraries/timdex-dataset-api

ghukill · 2026-05-18T18:53:31Z

Purpose and background context

This PR adds a new "fulltexts" data type to the TIMDEX dataset, sitting along pre-existing data types "records" and "embeddings".

This data type add leverages the new data type parity, requiring basically just a schema + metadata definition, relying on shared code for all data types.

This "fulltext" rows capture fulltext associated with a TIMDEX record. At this time, only DSpace theses will have fulltext records, but we may explore including LibGuides and MITLib websites fulltext in here as well (currently they are added directly to the Opensearch document by Transmogrifier, making them searchable, but not available in the dataset for any additional uses). We may also explore fulltext for other sources, e.g. archival materials, Dome materials, etc.

Unlike embeddings, there is not much metadata about fulltexts at this time. We do include an MD5 checksum of the fulltext bytes to avoid re-harvesting / re-writing fulltext in the future, but that is not currently utilized (though CLI app dspace-fulltext-harvester may use it). The primary data column is simply fulltext with the fulltext of the record as bytes.

The first planned use of this new data type is this:

For source dspace, run the dspace-fulltext-harvester as part of the TIMDEX ETL StepFunction
Write data back to this new data type
Update TIM to update Opensearch records with fulltext stored here, very similar to how we did embeddings

How can a reviewer manually see the effects of these changes?

1- Set Dev TimdexManagers Credentials

2- Set env vars:

TDA_LOG_LEVEL=DEBUG
WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3,MARKDOWN
TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset

3- Load dataset:

import os

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])

4- Write batch of fulltexts records, using an arbitrary researchdatabases ETL run as the source TIMDEX records:

from tests.utils import generate_sample_fulltexts_for_run

# utility to generate some fulltexts
fulltexts = generate_sample_fulltexts_for_run(
    timdex_dataset=td,
    run_id="18cbec87-ce25-4355-a526-caaa3eb1ac1e",  # run for 'researchdatabases'
)

# write to dataset, using recently normalized 'td.<source>.write(...)'
td.fulltexts.write(fulltexts)
# INFO:timdex_dataset_api.data_type:Dataset write complete - elapsed: 12.0s, total files: 1, total rows: 901, total size: 101180

5- Confirm our write worked and we can see records:

# get all fulltexts for 'researchdatabases' source
td.fulltexts.read_dataframe(table='current_fulltexts')

# get a single 'fulltexts' record (where the 'xxxxx...' is just filler text)
next(td.fulltexts.read_dicts_iter(table='current_fulltexts'))
"""
Out[12]: 
{'timdex_record_id': 'researchdatabases:az-65257807',
 'source': 'researchdatabases',
 'run_date': datetime.datetime(2026, 5, 1, 0, 0),
 'run_type': 'full',
 'action': 'index',
 'run_id': '18cbec87-ce25-4355-a526-caaa3eb1ac1e',
 'run_record_offset': 0,
 'run_timestamp': datetime.datetime(2026, 5, 1, 15, 24, 31, 248765, tzinfo=<DstTzInfo 'America/Detroit' EDT-1 day, 20:00:00 DST>),
 'filename': 's3://timdex-extract-dev-222053980223/dataset/data/fulltexts/year=2026/month=05/day=01/ab02e407-b62d-4807-8f3b-7dd1c3a4d6ad-0.parquet',
 'fulltext_timestamp': datetime.datetime(2026, 5, 1, 15, 24, 31, 248765, tzinfo=<DstTzInfo 'America/Detroit' EDT-1 day, 20:00:00 DST>),
 'fulltext_md5': '48dac3804fbe73d3aeb1e272e0460045',
 'fulltext': b'Sample fulltext content for researchdatabases:az-65257807.  Content xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'}
"""

Note that the record returned, and as we'll see below in step 6, contain additional metadata about the TIMDEX record like [source, run_date, action, ...] etc. This is a byproduct of the v5 refactoring, making all data types in the dataset see very similar "base" metadata for each row, specifically the record it's associated with.

6- Confirm the fulltext_md5 is a queryable / filterable metadata field, not a data field:

td.conn.query("""select * from metadata.fulltexts where fulltext_md5 = '967d0a0bf60829bee5dfc05476cc0f0f';""")
# NOTE: there may be multiple rows for this MD5 checksum, given multiple writes

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-531

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: The TIMDEX dataset originally held "records". Then we bolted on "embeddings" as another data type in the dataset, but with its own code paths. Then we refactored so "records" and "embeddings" had the same signatures and shared code. Now, we're ready for a new data type "fulltexts": full text, not just metadata, that is associated with a record. This data type will have very little metadata beyond associating with a specific TIMDEX record version. How this addresses that need: Adds new module `data_types/fulltexts.py` with the main class `TIMDEXFulltexts`. This class follows the plural naming conventions of "records" and "embeddings". The primary data column for this data type is "fulltext", the actual text payload. A metadata column `fulltext_md5` is also added to fingerprint the text and provide a way to avoid storing the fulltext again if not needed. Side effects of this change: * Arguably, none really. The TIMDEX dataset gets a new data type, but it's inconsequential until used. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-531

Additionally, move data type related imports back to root namespace for backwards compatibility.

ehanson8

Works as expected and code looks good!

ghukill changed the base branch from main to USE-531-fulltext-data-type May 18, 2026 18:53

ghukill marked this pull request as ready for review May 18, 2026 19:12

ghukill requested a review from a team as a code owner May 18, 2026 19:12

ghukill changed the base branch from USE-531-fulltext-data-type to main May 18, 2026 20:59

ghukill mentioned this pull request May 19, 2026

USE 558 - Write fulltext records to TIMDEX dataset MITLibraries/dspace-fulltext-harvester#11

Merged

Bump version to v5.1

7fe726f

Additionally, move data type related imports back to root namespace for backwards compatibility.

ghukill force-pushed the USE-531-fulltext-data-type-add branch from 7dd5850 to 7fe726f Compare May 19, 2026 18:12

ehanson8 self-assigned this May 20, 2026

ehanson8 approved these changes May 20, 2026

View reviewed changes

ghukill merged commit 450af3c into main May 20, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 531 - Add "fulltexts" data type#189

USE 531 - Add "fulltexts" data type#189
ghukill merged 2 commits into
mainfrom
USE-531-fulltext-data-type-add

ghukill commented May 18, 2026 •

edited

Loading

Uh oh!

ehanson8 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ghukill commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ghukill commented May 18, 2026 •

edited

Loading