I’ve mentioned Kaze Emanuar’s efforts to make the best Mario 64 there can possibly be on its native hardware. He’s compiled it with optimization flags turned on, made its platforming engine much more efficient, and worked hard to minimize cache misses, which was a major source of slowdowns in the game’s code. Under his efforts, he’s gotten the engine running at 60fps (although not yet in a playable version of the original). While these optimizations are not the kind of thing that can keep being found indefinitely, he’s bound to run out of ways to tune up the code, currently he’s still finding new ways to speed it up.
He made a Youtube video detailing his most recent optimization find: getting the game’s trigonometric functions executing at their speediest. What is interesting is that the Mario 64 code already uses a couple of tricks to get sine and cosine results in a rapid manner: the game only uses 4096 discrete angles of movement direction, and contains a lookup table that covers each of those angles. But it turns out that this optimization is actually a mis-optimization, because the RAM bus hits incurred to read the values into the cache are actually more expensive than just figuring out the values in code on the N64’s hardware!
The video starts out decently comprehensible, but eventually descends into the process of figuring out sine and cosine on the fly, and the virtues of the various ways this can be done, so you can’t be faulted for bailing before the end, possibly at the moment the dreaded words “Taylor series” are mentioned. But it’s a fairly interesting watch until then!