CLUBB update and clubb_intr improvements#1441
CLUBB update and clubb_intr improvements#1441huebleruwm wants to merge 34 commits intoESCOMP:cam_developmentfrom
Conversation
… clubb_grid_dir = 1, no BFB (as expected) with -1
|
@huebleruwm - After reading through this text, I moved this PR to draft. Once it is ready for the SEs to review and process it for bringing into CAM, please move it out of draft. |
…grid, and flipping sections have been consolidated to directly around the time stepping loop. Next is to push them inward until the only clubb grid calculations are done inside advance_clubb_core.
…s descending BFB (except wp3_ta but that's internal and doesn't matter) - this is even true with clubb_cloudtop_cooling, clubb_rainevap_turb, and do_clubb_mf.
…ending mode, even though there is no notion of ascending in it. There must be some bug with the flipper perhaps? Otherwise an internal interaction between clubb and silhs somehow, it's very upsetting, but I think it's time to give up and come back to it someday.
…oning yet. All changes should be BFB, but a handful of field outputs are different above top_lev because I made it zero everything above top_lev. Among the fields that differ: some (RTM_CLUBB RTP2_CLUBB THLM_CLUBB UP2_CLUBB WP2_CLUBB) were being initiazed to a tolerance above top_lev, and others (THLP2_CLUBB RTP2_CLUBB UM_CLUBB VM_CLUBB) were never initialized or set above top_lev, so we were outputting random garbage.
…ootprint and data copying steps in clubb_intr, mainly by switching pbuf variables (which are mostly clubb_inouts) to have nzm or nzt dimensions, allowing them to be passed into clubb directly, making the copying/flipping step unnecesary. This was tested with a top_lev > 1, so it should be well tested. There were some above top_lev interactions found, which seem erroneous, so I've marked them with a TODO and a little explaination.
ed11ffc to
b2b3232
Compare
|
@adamrher I have done a number of ECT tests and it was pretty much as expected. I made a baseline from ESCOMP/cam_development commit To check the more "dangerous" changes to clubb_intr, I made another baseline from commit Other things I tested:
So I think the NoteAlso, I found a number of examples where clubb_intr is interacting above |
|
I ran some ECT tests with the performance options:
|
|
@huebleruwm I checked out this PR branch, and while the clubb_intr.F90 changes seem to be there, the .gitmodules are pointing to the clubb externals currently on cam_development: This should be pointing to a different tag / hash containing support for descending / ascending options, no? |
Yes. In hindsight, it would've made much more sense to update that with the right hash each commit, rather than just including it in the commit comment. I've just been going to |
|
I just pushed my last commit, so it's all yours now. It changes clubb_mf to operate on descending arrays, and I've confirmed it's b4b with the ascending version of the code. We can now get rid of the ascending code entirely. I'll leave that to you, along with making a new clubb tag and putting it in I should be able to finish up the science validation by tomorrow -- I'll let you and Vince know. |
|
Is the clubb update @HingOng's PR incorporating non-traditional coriolis terms? That would be great to include here. |
Yes, that's the one. Just waiting on the nightly tests to make sure they all pass. It should all be BFB with the new flags set to false. |
Which ascending code is that? Is this ascending code you're referring to specific to CLUBB-MF or does it apply more generally to CLUBB? I thought we were going to leave in some ascending code just to allow testing of the ascending version of CLUBB within CAM. |
I forgot that we did agree to keep the ascending code option for CLUBB for debugging purposes. As for CLUBB-MF, the |
Setting |
Ah, yeah, you're right. Bottom line is that I changed |
|
I added the changes needed for the new clubb coriolis code and made a tag that I updated the gitmodules with ( |
…tmodules with a new tag
9236e32 to
fbcaf32
Compare
|
I did some more targetting timings with the options were considering turning on by default. All speedups are measured against a baseline run with the current default settings.
The small speedup from the new |
For this configuration, what speed-up do you see? |
|
@vlarson just looking at one of the 2 year segments for each case, the mean |
adamrher
left a comment
There was a problem hiding this comment.
@huebleruwm this is my first pass at a code review, making it only to the beginning of clubb_tend_cam. I had a few questions and clarifications but no major issues. There's a lot of code clean up and variable clarification which is awesome.
I'm not sure how much time I'll have in the next two weeks to finish this code review, with AMWG coming up two weeks from today. I'll get to it as soon as I can.
adamrher
left a comment
There was a problem hiding this comment.
@huebleruwm I made some more progress on the code review, but I'll have to finish it up another day. I really like how you've added the _pbuf modifier to all the pbuf vars. It will make this code a lot easier to understand going forward!
adamrher
left a comment
There was a problem hiding this comment.
Publishing my morning review now so it doesn't get lost. Will pick up with a separate review this afternoon.
|
|
||
| ! For consistency, set surface pressure to p_in_Pa at the first thermo. | ||
| ! level above the surface (according to the CLUBB ascending grid). | ||
| p_sfc(i) = state_loc%pmid(i,pver) |
There was a problem hiding this comment.
If this is not the surface pressure (state_loc%pint(i,pverp)) than it needs to named something else. Perusing advance_clubb_core.F90 it is consistently referred to as surface pressure in the comments, so I'm skeptical of setting this to state_loc%pmid(i,pver) (the mid point pressure in the lowest model level).
There was a problem hiding this comment.
After talking with Vince and checking it out in the code - p_sfc is the right name and it should be set to state_loc%pint. So this is a bug.
The only place it's used in the code is pdf_closure, see here, and is used when l_call_pdf_closure_twice = .true., which it is in cam. So this will change results.
I also did a test of changing this from %pmid to %pint, and doing so breaks the ECT test for ~40 variables, many of which report >5 sigma differences. So it seems like we'll need another science evaluation if we want to change this. I'll leave it as is for now and add a little TODO comment explaining the problem.
There was a problem hiding this comment.
Darn. I'll bring this up in our SE mtg tomorrow and report back on next steps.
There was a problem hiding this comment.
Just to clarify, p_sfc was not an argument to advance_clubb_core in the clubb tag currently on cam_development. Is it a bug on cam_development currently? Or did we introduce this bug in this PR?
There was a problem hiding this comment.
Looks like the way this was handled before was by using p_in_Pa(i,1) in place of p_sfc(i).
- Before the ghost point removal we used:
p_in_Pa_zm(i,1) = p_in_Pa(i,1) - After the removal that became:
p_in_Pa_zm(i,1) = p_sfc(i)
The comment in that section of pdf_closure where it's used says this:
! The pressure at thermodynamic level k = 1 has been set to be the surface
! (or model lower boundary) pressure. Since the surface (or model lower
! boundary) is located at momentum level k = 1, the pressure there is
! p_sfc, which is p_in_Pa(1). Thus, p_in_Pa_zm(1) = p_in_Pa(1).
So internally, clubb was assuming that the ghost point in p_in_Pa contained the "true" (momentum) surface pressure, and that IS how it was being set in clubb_standalone, using
p_in_Pa(1) = p_in_Pa_zm(1)
but cam was setting
p_in_Pa(i,k+1) = state1%pmid(i,pver-k+1)
p_in_Pa(i,1) = p_in_Pa(i,2)
and I think that should have been p_in_Pa(i,1) = state1%pint(i,pver)
So this was a bug that's been in cam the whole time - I tracked it back 7 years to a commit named CAM6 initial release where it first appeared, and clubb has been making the same internal assumption that p_in_Pa(1) was the momentum surface pressure the whole time.
There was a problem hiding this comment.
I'm preforming a science eval on this bug fix. I'll provide an update here ASAP.
adamrher
left a comment
There was a problem hiding this comment.
@huebleruwm I've finished my review of your PR. I have several comments that should be addressed. I do think you found a bug in our code for top_lev > 1 in the energy fixer and the applied tendencies, so thanks for catching that!
| end do | ||
| end do | ||
| end if | ||
|
|
||
| !---- TODO: there seems to a an above top_lev interaction here that changes answers. |
There was a problem hiding this comment.
I agree with @huebleruwm that that this is a bug in the code that would lead to spurious ptend%s tendencies above top_lev. While top_lev determines the top of the moist physics, and therefore where clouds can form, they may be advected to higher levels by the dycore. I endorse @huebleruwm's fixes (which are currently commented out to give bfb with the buggy code)
[edit -- april 9 -- I have committed this fix in b740f05c8a76e0985283057760f3a33a5a036d68]
|
@adamrher I think I've addressed all the PR comments now and made the changes over the last 2 commits. They were all pretty small changes - moving things around, renaming stuff, comment tweaks, etc. I ran the usual ECT tests on CPU and GPU and both are passing. Unless I've missed something, all that's left to do is deal with the |
| ! TODO: p_sfc should be set to state_loc%pint, not state_loc%pmid. | ||
| ! Changing it will change answers when clubb_l_call_pdf_closure_twice = .true. |
There was a problem hiding this comment.
Still one TODO here . Please remove or modify the comment
There was a problem hiding this comment.
This is the p_sfc issue that Adam is looking into right now, #1441 (comment) - I just figured I'd mark it to make clear that we're not 100% done yet.

There's two parts here - getting a new version of clubb in, and enabling the descending grid mode in clubb_intr. Both these goals were split up over many commits. Fixes #1411
New CLUBB
The first commits are dedicated to getting a new version of CLUBB in. Because of clubb_intr diverging between
cam_development, which had significant changes to enable GPUization, and UWM's branch, which had redimensioning changes, the merging was done manually.The first 6 commits here include changes that can be matched with a certain version of CLUBB release:
3d40d8c0e03a298ae3925564bc4db2f0df5e9437works with clubb hash1632cf12e67fc4719fa21f8e449b024a0e3b6df2d0a7f8cbworks with clubb hash673beb05187d7b536c2f36968fc7f5e1b9d1167e430ad03fworks with clubb hashdc302b95e4b71220b33aeaddb0afc68c9103555edccb59ebworks with clubb hashdc302b95703aca60ed1e0b6b24f2cd890c3a4497041d25b8works with clubb hashd5957b304d9b1b8a528ca532d964c1799e1860e96e068a12works with clubb hashd5957b30(to use the clubb hash, go to
src/physics/clubband run the git checkout there)These commits all have to do with just getting a new version of clubb in, so we need to ensure that at least
4d9b1b8a528ca532d964c1799e1860e96e068a12is working correctly. The later commits have more complicated changes to clubb_intr, so if we find any problems with this branch, we should test commit4d9b1b8a528ca532d964c1799e1860e96e068a12as a next step.clubb_intr improvements
The next commits after getting new clubb in are increasingly ambitious changes aimed at simplifying clubb_intr and reducing its runtime cost.
e60848b4ec4df90a3060ffd7f664fab42e847509introduces a parameter,clubb_grid_dir, in clubb_intr that controls which direction the grid is when calling advance_clubb_core. When using -O0 to compile, and settingl_test_grid_generalization = .true.insrc/clubb/src/CLUBB_core/model_flag.F90, the results are BFB in either grid direction.dddff494966bf2bf4341fa4a7526b2f8b0f3d16eseparates the flipping code from the copying code (copying cam sized arrays to clubb sizes arrays). Should all be BFB.055e53f70741531a58d0f7da788b824c76fef087pushes flipping code inward until it's directly around the call toadvance_clubb_core, and the way of controlling the grid direction has been changed to a flag,l_ascending_grid. This should all be BFB as well, and I tested withclubb_cloudtop_cooling,clubb_rainevap_turb, anddo_clubb_mfall true to ensure BFBness in either ascending or descending mode. One caveat - the clubb_mf code assumes an ascending grid, so before calling it we flip the arrays, then flip the outputs to descending.2fb2ba9bd6b1ec5e5c60039f90d0c1020663d0c9is a pretty safe intermediate commit, mainly moving stuff around in preparation for redimensioning.bded8a561131e4dbbccad293f14226e5e8c0e856is some very safe dimensionings.fce8e1b1b5e8d3e93232c2129be7671fae79db23is the big one that redimensions most pbuf arrays, and uses them instead of the clubb_sizes local ones. This allows us to avoid the vast majority of the the data copying, and delete the local version of the clubb arrays.The rest are safe and easy things mainly, or commits to make GPU code work.
Testing
I plan to run an ECT test to compare the
cam_developmenthead I started with to the head of this branch. Answers are expected to change slightly due to the ghost point removal in clubb, so I think it unlikely that this passes, but if it does that would be fantastic, and might be the only testing we really need.If that first ECT test doesn't pass, then I'll will assume that the difference is due to the ghost point removal (or other small bit changing commits in clubb over the past year), and rely on @adamrher to check if the results are good.
The biggest concern is if the answer changes are acceptable, and the only real differences expected are from the new version of CLUBB. If the answers from this branch look bad, we should go back to commit
4d9b1b8a528ca532d964c1799e1860e96e068a12and check the answers from that, since it only includes the new version of CLUBB, and NO unnecessary clubb_intr changes. If the answers still look bad in that commit, then we have a few more we can step back through to try to figure out which commit introduces the differences. If the hypothetical problem is present in the first commit (3d40d8c0e03a298ae3925564bc4db2f0df5e9437), then the problem is harder, because that includes (pretty much only) the removal of the ghost point in clubb, which is expected to change answers, but hopefully not significantly.Again if that first ECT test fails, then I can still run another ECT test between the version where clubb is up to date (
4d9b1b8a528ca532d964c1799e1860e96e068a12) and the head of this branch. The changes between that commit and the head may be slightly bit changing without (-O0), but definitely shouldn't be answer changing. If this ECT test fails, then I've made a mistake in the changes meant to avoid the flipping/copying, and I'll have to step back through and figure out what bad thing I did.Some other tests I've ran along the way help confirm at least some of these changes:
e60848b4ec4df90a3060ffd7f664fab42e8475094d9b1b8a528ca532d964c1799e1860e96e068a12should be working on the GPU, but later commits are untestedNext steps
I left a number of things messy for now, such as comments and gptl timers. Once we confirm these changes, I'd like to go through and make some of those nicer as a final step.
Performance
In addition to the clubb_intr changes improving performance, we should use this as an opportunity to try to flags that should be significantly faster:
clubb_penta_solve_method = 2uses our custom pentadiagonal matrix solvers, which should be significantly faster than lapack and should pass an ECT testclubb_tridiag_solve_method = 2uses our custom tridiagonal solver, which should also be faster and pass an ECT testclubb_fill_holes_type = 4uses a different hole filling algorithm, that in theory is just all around better and faster than our current one, and I suspect it will pass ECT, but I have yet to test itclubb_l_call_pdf_closure_twice = .false.will avoid a pdf_closure call (which is significant in cost) and reduce the memory footprint , and I think will have minimal effect (based on my visual analysis of what that affects in the code), but is the most likely to break the ECT test