Skip to content

[SYSTEMDS-3948] Row-wise Sparsity Estimator#2466

Open
ywcb00 wants to merge 15 commits into
apache:mainfrom
ywcb00:feat/sparsity/estim/rs
Open

[SYSTEMDS-3948] Row-wise Sparsity Estimator#2466
ywcb00 wants to merge 15 commits into
apache:mainfrom
ywcb00:feat/sparsity/estim/rs

Conversation

@ywcb00
Copy link
Copy Markdown
Contributor

@ywcb00 ywcb00 commented May 6, 2026

Hi,
this PR adds the row-wise sparsity estimator from:
Lin, Chunxu, Wensheng Luo, Yixiang Fang, Chenhao Ma, Xilin Liu and Yuchi Ma;
On Efficient Large Sparse Matrix Chain Multiplication;
Proceedings of the ACM on Management of Data 2 (2024): 1 - 27.

Note that the row sparsity propagation, as described in the publication, applies to MM chains only. Other operations use fallback methods for sparsity estimation w/ row sparsity vectors, which create a cut in the sparsity estimation DAG.

Copy link
Copy Markdown
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. hope it helps.

}


// Row Wise Sparsity Estimator
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can optionally have all of these classes extend another class where you add these tests. Instead of copying the code on all of them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. I do not fully understand how you would consolidate them. Could you please clarify?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I ment was something like :

OpElemWTest extends OpElemBaseTest {

}

and then you define a new OpElemBaseTest class that contains the variables all the different OpElem test cases share, and define in that base test the fundamental tests that all of the OpElem should run.

I hope this would work nicely, since it makes it easy to make the same test flavour run in all of the variations.

Copy link
Copy Markdown
Contributor Author

@ywcb00 ywcb00 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the explanation, I gave it another thought. While I also like such a hierarchical structure for tests in general, I cannot see the clear advantage here--given that I understood your suggestion correctly.
There are five common variables among the test cases m, n, sparsity, mult, plus, which would then be defined inside OPElemBaseTest class instead of OPElemWTest class. I would not consolidate tests for multiplication and addition (neither for different estimators) in one single function, as this would break the typical structure in which tests are defined (right?). Hence, I think such a base class would just add another class file, or have I misunderstood something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the restructure would be advantageous. Since many of these tests have the same fundamental structure, and are not very well tested with only one matrix size. Alternatively you could change the entire class to be a parameterized test instead. That would also help you make more tests easily.

e.g.


@RunWith(value = Parameterized.class)
public class OpElm {
  public OpElm(Estimator e, int m, int n, double s, BinaryOp BOp){
   ...
  }


  @Parameters
  public static Collection<Object[]> data() {
  ...


  }


}

See for instance src/test/java/org/apache/sysds/test/component/compress/CompressedTestBase.java.

To make the tests faster as well you can set as input the actual matrix, and reuse the same matrix across many tests. The current structure create the matrix for each test, and that is by far the most expensive part of these tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience. I really like the suggestion of parameterized tests, and changed it accordingly. :)

I also tried to reuse the input matrices instead of generating them from scratch every time. However, as it did not result in a speedup of the tests, I removed it again.

Comment thread src/test/java/org/apache/sysds/test/component/estim/OpSingleTest.java Outdated
Comment thread src/test/java/org/apache/sysds/test/component/estim/OpSingleTest.java Outdated
Comment thread src/main/java/org/apache/sysds/hops/estim/EstimatorRowWise.java Outdated
Comment thread src/main/java/org/apache/sysds/hops/estim/EstimatorRowWise.java Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 76.28205% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.34%. Comparing base (c2d0e5a) to head (4855b08).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
.../org/apache/sysds/hops/estim/EstimatorRowWise.java 76.28% 24 Missing and 13 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2466      +/-   ##
============================================
+ Coverage     71.29%   71.34%   +0.04%     
- Complexity    48574    48686     +112     
============================================
  Files          1567     1569       +2     
  Lines        188401   188776     +375     
  Branches      36980    37040      +60     
============================================
+ Hits         134325   134673     +348     
+ Misses        43630    43628       -2     
- Partials      10446    10475      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ywcb00 ywcb00 force-pushed the feat/sparsity/estim/rs branch from 30b1f54 to 2361593 Compare May 11, 2026 08:47
ywcb00 added 13 commits May 15, 2026 15:38
…ise sparsity estimator

	works for the matrix multiplication and bind test cases for now
…ontainer for row wise sparsity vectors to simplify access and allow storing it with chain nodes
…mator with element-wise and single operations
…wise and single operations

	NOTE: using average case estimation per row
…ro and diag (mv and vm) operations with the row-wise sparsity estimator
… to consolidate all calls to getters before the switch
… for row-wise sparsity vector and apply the corresponding operations directly in the code instead
@ywcb00 ywcb00 force-pushed the feat/sparsity/estim/rs branch from 1f2c869 to 30cc518 Compare May 15, 2026 13:39
@ywcb00
Copy link
Copy Markdown
Contributor Author

ywcb00 commented May 15, 2026

Thank you for the comments @Baunsgaard . I have updated the PR accordingly. :)

Copy link
Copy Markdown
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

Comment thread src/main/java/org/apache/sysds/hops/estim/EstimatorRowWise.java Outdated
Comment thread src/main/java/org/apache/sysds/hops/estim/EstimatorRowWise.java Outdated
Comment thread src/main/java/org/apache/sysds/hops/estim/EstimatorRowWise.java Outdated
Comment thread src/main/java/org/apache/sysds/hops/estim/EstimatorRowWise.java Outdated
Comment thread src/main/java/org/apache/sysds/hops/estim/EstimatorRowWise.java Outdated
Comment thread src/main/java/org/apache/sysds/hops/estim/EstimatorRowWise.java Outdated
@ywcb00
Copy link
Copy Markdown
Contributor Author

ywcb00 commented May 19, 2026

Thank you for the valuable comments. :)

Copy link
Copy Markdown
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR LGTM, there are only micro optimizations to do now.

Comment on lines +221 to +222
* NOTE: this is the best estimation possible when we only have the two row sparsity vectors
* NOTE: Considering the average of the second matrix would probably not be far off while saving computing time
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a wrong statement.

The estimator here is the uniform estimator ( also called average-case or in fancy terms Naive Bayes estimator )

It is not the 'best', if that even exists, and it is an interesting point to consider if you are going in this direction for your research.

a clear way to see it is not best or worst case is:

rsM1 = [1,0,0] and rsM2 = [1/3,1/3,1/3]

then:

rsOut[0] = 1 - (1 - 1·1/3)^3
         = 1 - (2/3)^3
         = 1 - 8/27
         = 19/27 ≈ 0.704

which is impossible to get in practice.

for these values the matrices could be :

M1 = [1 1 1]      M2 = [1 0 0]
     [0 0 0]           [0 1 0]
     [0 0 0]           [0 0 1]

C = M1·M2 = [1 1 1]   →  row-0 actual density = 1.0
            [0 0 0]
            [0 0 0]

formular says 0.704 while the matrix actually is 1.0

another fun one is :

M1 = [1 1 1]      M2 = [1 0 0]
     [0 0 0]           [1 0 0]
     [0 0 0]           [1 0 0]

C = M1·M2 = [3 0 0]   →  row-0 actual density = 1/3 ≈ 0.333
            [0 0 0]
            [0 0 0]

it ends up as 0.333 density, but the estimator says 0.7 again.

for(int i = 0; i < rsM1.length; i++) {
double currentVal = 1;
for(int j = 0; j < rsM2.length; j++) {
currentVal *= (double) 1 - (rsM1[i] * rsM2[j]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider on code like this to do :

currentVal *= 1.0 - (rsM1[i] * rsM2[j]);

you do not need to write the cast.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also consider early out, if you encounter zero, do not enter the inner loop

rsM1[i] == 0;

* NOTE: this is the best estimation possible when we only have the two row sparsity vectors
* NOTE: Considering the average of the second matrix would probably not be far off while saving computing time
*/
private double[] estimInternMMFallback(double[] rsM1, double[] rsM2) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to the cbind this depends on number of rows/cols in the input, right ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants