"M" Performance... Pessimizations! or: SSE is a Scam.


This article is a continuation of the M series.

The vpatch given below entirely re-implements the TLB (MMU) of M to use SIMD instructions from the AMD64 SSE2 set.

Whereas previously TLB entries were kept in memory and searched iteratively, now we keep the Tags (3 byte each) sliced into three XMM registers, and search them in parallel, e.g.:

        %define TLB_TAG_BYTE_0       xmm5  ; Byte 0 of Tag
        %define TLB_TAG_BYTE_1       xmm6  ; Byte 1 of Tag
        %define TLB_TAG_BYTE_2       xmm7  ; Byte 2 of Tag
        %define XMM_T0               xmm8  ; Temp
 
        ;; .....
 
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        ;; Tag being sought is in ecx;
        ;; Stored Tag slices are in TLB_TAG_BYTE_0 .. TLB_TAG_BYTE_2.
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        ;; Search for B0, B1, B2 of Tag, accumulate result in ebx ;;
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        ; Search for Byte 0 of Tag:
        mov       edx,    ecx           ; edx := ecx (wanted Tag)
        and       edx,    0xFF          ; Byte 0 (lowest) of wanted Tag
        ; Fill T0 with 16 copies of Tag Byte 0:
        movd      XMM_T0, edx
        punpcklbw XMM_T0, XMM_T0
        punpcklwd XMM_T0, XMM_T0
        pshufd    XMM_T0, XMM_T0, 0
        ; Now SIMD-compare:
        pcmpeqb   XMM_T0, TLB_TAG_BYTE_0
        ; Get the result mask of the compare:
        pmovmskb  ebx,    XMM_T0         ; i-th bit in ebx = 1 where match B0
        test      ebx,    ebx            ; if Byte 0 of Tag not found:
        jz        .Not_Found             ; ... then go straight to 'not found'
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        ; Search for Byte 1 of Tag:
        mov       edx,    ecx              ; edx := ecx (wanted Tag)
        shr       edx,    8                ; Byte 1 (middle) of wanted Tag
        and       edx,    0xFF
        ; Fill T0 with 16 copies of Tag Byte 1:
        movd      XMM_T0, edx
        punpcklbw XMM_T0, XMM_T0
        punpcklwd XMM_T0, XMM_T0
        pshufd    XMM_T0, XMM_T0, 0
        ; Now SIMD-compare:
        pcmpeqb   XMM_T0, TLB_TAG_BYTE_1
        ; Get the result mask of the compare:
        pmovmskb  edx,    XMM_T0           ; i-th bit in edx = 1 where match B1
        and       ebx,    edx              ; Keep only where B0 also matched
        test      ebx,    ebx              ; if Bytes 0+1 of Tag not found:
        jz        .Not_Found               ; ... then go straight to 'not found'
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        ; Search for Byte 2 of Tag:
        mov       edx,    ecx              ; eax := edx (wanted Tag)
        shr       edx,    16               ; Byte 2 (top) of wanted Tag
        and       edx,    0xFF
        ; Fill T0 with 16 copies of Tag Byte 2:
        movd      XMM_T0, edx
        punpcklbw XMM_T0, XMM_T0
        punpcklwd XMM_T0, XMM_T0
        pshufd    XMM_T0, XMM_T0, 0
        ; Now SIMD-compare:
        pcmpeqb   XMM_T0, TLB_TAG_BYTE_2
        ; Get the result mask of the compare:
        pmovmskb  edx,    XMM_T0           ; i-th bit in edx = 1 where match B2
        and       ebx,    edx              ; Keep only where B0,B1 also matched
        test      ebx,    ebx              ; if Bytes 0+1+2 of Tag not found:
        jz        .Not_Found               ; ... then go straight to 'not found'
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

Surprisingly, this was found to slow down the execution of M by approximately 30% (as measured by the Dhrystone benchmark.).

Before anyone comments, I am aware that the punpcklbw/punpcklwd/pshufd sequence can be replaced with one instruction on processors that support SSE3. However, none of the machines on which I was considering deploying this program, support it. Or ever will.


I have decided to post the patch regardless, so that others may attempt to determine whether this holds true on later (i.e. newer than the AMD 2393SE I tested on) irons.


You will need:

Add the above vpatch and seal to your V-set, and press to simd_tlb_lookup.kv.vpatch simd_tlb_errata.kv.vpatch.

Build and test as described in the previous article. Dhrystone is included in the demo booter.




Edit: removed the nonfunctional TLB cache. Performance is now approximately on-par with the non-SIMD version. SSE is still a scam.


Edit: I neglected to document this earlier: the poweroff command in the Busybox shell will cleanly exit the emulator.


~Probably to be not continued!~

Currently I suspect that this line of research is a dead end!

Though at some point I will post the kernel patches so that someone else could continue smashing his head against this wall, if so wishes.

This entry was written by Stanislav , posted on Sunday July 28 2019 , filed under Bitcoin, Cold Air, Computation, Cryptography, Distractions, Friends, MIPS, SoftwareArchaeology, SoftwareSucks . Bookmark the permalink . Post a comment below or leave a trackback: Trackback URL.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" highlight="">


MANDATORY: Please prove that you are human:

37 xor 91 = ?

What is the serial baud rate of the FG device ?


Answer the riddle correctly before clicking "Submit", or comment will NOT appear! Not in moderation queue, NOWHERE!