Skip to content

HBASE-30061 Add EWMA-based BlockCompressedSizePredicator#8075

Open
apurtell wants to merge 1 commit intoapache:masterfrom
apurtell:HBASE-30061
Open

HBASE-30061 Add EWMA-based BlockCompressedSizePredicator#8075
apurtell wants to merge 1 commit intoapache:masterfrom
apurtell:HBASE-30061

Conversation

@apurtell
Copy link
Copy Markdown
Contributor

PreviousBlockCompressionRatePredicator has three algorithmic deficiencies that cause compressed blocks to systematically undershoot the configured block size target: integer division truncation, single-sample estimation, and no smoothing of the estimated compression ratio.

EWMABlockSizePredicator addresses these issues with double-precision arithmetic and weighted moving average smoothed estimation of the compression ratio. This produces compressed HFile blocks that are closer to the configured target block size.

The ratio is smoothed using a default alpha of 0.5. This adapts quickly to changing data while dampening single-block variance. After 3 blocks, the EWMA captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks within a single file tend to have similar compression ratios (same column family, similar data distribution), and fast adaptation matters more than long-term smoothing since predicator state is per-file.

Adds HFileBlockPerformanceEvaluation to microbenchmark HFileBlock related concerns.

==========================================================================================
  PREDICATOR ACCURACY RESULTS
==========================================================================================

Predicator     Compr  Encoding   BlkSize Blocks MeanOnDisk   Stddev        Min        Max    Dev%
-------------- ------ ---------- ------- ------ ---------- -------- ---------- ---------- -------
Uncompressed   none   NONE         65536   2907      66596      926      16681      66613    1.6%
PrevBlock      none   NONE         65536   2907      66596      926      16681      66613    1.6%
EWMA           none   NONE         65536   2907      66596      926      16681      66613    1.6%

Uncompressed   none   FAST_DIFF    65536   2907      64460      896      16171      64481    1.6%
PrevBlock      none   FAST_DIFF    65536   2819      66474      986      14159      66497    1.4%
EWMA           none   FAST_DIFF    65536   2819      66474      986      14159      66497    1.4%

Uncompressed   gz     NONE         65536   2996      22354        5      22338      22369   65.9%
PrevBlock      gz     NONE         65536   2987      44206      400      22350      44233   32.5%
EWMA           gz     NONE         65536   2954      65700     1399       3264      65758    0.2%

Uncompressed   gz     FAST_DIFF    65536   2996      22204        5      22190      22227   66.1%
PrevBlock      gz     FAST_DIFF    65536   2987      45257      422      22193      45289   30.9%
EWMA           gz     FAST_DIFF    65536   2938      65563      999      22202      65616    0.0%

PreviousBlockCompressionRatePredicator has three algorithmic deficiencies
that cause compressed blocks to systematically undershoot the configured
block size target: integer division truncation, single-sample estimation,
and no smoothing of the estimated compression ratio.

EWMABlockSizePredicator addresses these issues with double-precision
arithmetic and weighted moving average smoothed estimation of the
compression ratio. This produces compressed HFile blocks that are closer
to the configured target block size.

The ratio is smoothed using a default alpha of 0.5. This adapts quickly
to changing data while dampening single-block variance. After 3 blocks,
the EWMA captures 87.5% of the true ratio. Alpha = 0.5 is chosen because
HFile blocks within a single file tend to have similar compression ratios
(same column family, similar data distribution), and fast adaptation
matters more than long-term smoothing since predicator state is per-file.

Adds HFileBlockPerformanceEvaluation to microbenchmark HFileBlock related
concerns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant