On Wed, Dec 13, 2017 at 6:07 AM, Martin Vignali <martin.vignali at gmail.com> wrote: > + vpermq m1, [srcq + xq - mmsize + %3], 0x4e; flip each lane at load > + vpermq m2, [srcq + xq - 2 * mmsize + %3], 0x4e; flip each lane at load Would doing 2x 128-bit movu + 2x vinserti128 be faster?