Could we do better? Assuredly. There are many AVX-512 instructions that we are not using yet. We do not use ternary Boolean operations (vpternlog). We are not using the new powerful shuffle functions (e.g., vpermt2b). We have an example of coevolution: better hardware requires new software which, in turn, makes the hardware shine.
AVX512 跑上去有機會到 170w, 的確是廢物指令集