Add a fast integer divide that rounds to zero#6455
Conversation
| Expr xsign = select(numerator > 0, cast(t, 0), cast(t, -1)); | ||
|
|
||
| // Multiply-keep-high-half | ||
| result = (cast(wide, mul) * numerator); |
There was a problem hiding this comment.
I think this should use widening_mul intrinsics, because uses of this are after find_intrinsics. Maybe this whole sequence should be mul_shift_right.
There was a problem hiding this comment.
Actually this code is only called directly by users, so it's before find_intrinsics. The compiler doesn't ever call this.
There was a problem hiding this comment.
Maybe add this as a comment for future readers.
There was a problem hiding this comment.
I actually think we should change it to intrinsics anyways. But since the code is just moved and pre-existing, maybe it should be a separate PR.
|
|
||
| // Reference good version | ||
| g(x, y) = input(x, y) / cast<T>(y + min_val); | ||
| // Reference good version |
There was a problem hiding this comment.
This looks identical to the case just above, are they supposed to be identical?
There was a problem hiding this comment.
Yes, they have different schedules which turn the denominator into a constant in one case but not the other.
| bool srz_method_0(int den, int sh_post, int bits) { | ||
| int64_t min = -(1L << (bits - 1)), max = (1L << (bits - 1)) - 1; | ||
| for (int64_t num = min; num <= max; num++) { | ||
| // for (int iter = 0; iter < 1000000L; iter++) { |
There was a problem hiding this comment.
Why is this commented out? If it's being left in for (eg) debugging purposes, please say so.
|
See also related issue #6456 |
|
review ping |
| Expr xsign = select(numerator > 0, cast(t, 0), cast(t, -1)); | ||
|
|
||
| // Multiply-keep-high-half | ||
| result = (cast(wide, mul) * numerator); |
There was a problem hiding this comment.
Maybe add this as a comment for future readers.
|
|
|
Looks like there's a bug in the handling of constant denominators (an early-out path that assumes we're rounding to -infinity). Will fix. |
|
See #7008 |
While working on legacy code I discovered a need for this. Performance test shows a good speed-up over native division for vector code: