USE 531 - Add "fulltexts" data type#189
Merged
Merged
Conversation
Why these changes are being introduced: The TIMDEX dataset originally held "records". Then we bolted on "embeddings" as another data type in the dataset, but with its own code paths. Then we refactored so "records" and "embeddings" had the same signatures and shared code. Now, we're ready for a new data type "fulltexts": full text, not just metadata, that is associated with a record. This data type will have very little metadata beyond associating with a specific TIMDEX record version. How this addresses that need: Adds new module `data_types/fulltexts.py` with the main class `TIMDEXFulltexts`. This class follows the plural naming conventions of "records" and "embeddings". The primary data column for this data type is "fulltext", the actual text payload. A metadata column `fulltext_md5` is also added to fingerprint the text and provide a way to avoid storing the fulltext again if not needed. Side effects of this change: * Arguably, none really. The TIMDEX dataset gets a new data type, but it's inconsequential until used. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-531
Additionally, move data type related imports back to root namespace for backwards compatibility.
7dd5850 to
7fe726f
Compare
ehanson8
approved these changes
May 20, 2026
ehanson8
left a comment
There was a problem hiding this comment.
Works as expected and code looks good!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR adds a new "fulltexts" data type to the TIMDEX dataset, sitting along pre-existing data types "records" and "embeddings".
This data type add leverages the new data type parity, requiring basically just a schema + metadata definition, relying on shared code for all data types.
This "fulltext" rows capture fulltext associated with a TIMDEX record. At this time, only DSpace theses will have fulltext records, but we may explore including LibGuides and MITLib websites fulltext in here as well (currently they are added directly to the Opensearch document by Transmogrifier, making them searchable, but not available in the dataset for any additional uses). We may also explore fulltext for other sources, e.g. archival materials, Dome materials, etc.
Unlike embeddings, there is not much metadata about fulltexts at this time. We do include an MD5 checksum of the fulltext bytes to avoid re-harvesting / re-writing fulltext in the future, but that is not currently utilized (though CLI app dspace-fulltext-harvester may use it). The primary data column is simply
fulltextwith the fulltext of the record as bytes.The first planned use of this new data type is this:
dspace, run the dspace-fulltext-harvester as part of the TIMDEX ETL StepFunctionHow can a reviewer manually see the effects of these changes?
1- Set Dev
TimdexManagersCredentials2- Set env vars:
3- Load dataset:
4- Write batch of
fulltextsrecords, using an arbitraryresearchdatabasesETL run as the source TIMDEX records:5- Confirm our write worked and we can see records:
Note that the record returned, and as we'll see below in step 6, contain additional metadata about the TIMDEX record like
[source, run_date, action, ...]etc. This is a byproduct of the v5 refactoring, making all data types in the dataset see very similar "base" metadata for each row, specifically the record it's associated with.6- Confirm the
fulltext_md5is a queryable / filterable metadata field, not a data field:Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Code review