Thanks for the feedback. I made changes to the patch. The performance has increased now to ~7 boost compared to C implementation. Changes: - Do not use v8-v15 registers. - Use urhadd instruction. - Reorder the instructions to increase performance. // Hubert