programing

휘발성은 비쌉니까?

nasanasas 2020. 8. 13. 23:27
반응형

휘발성은 비쌉니까?


읽고 나면 컴파일러 작가의 JSR-133 요리 책을 휘발성의 구현, 특히 부분에 대해 "원자 지침과 상호 작용"나는 그것을 업데이트하지 않고 휘발성 변수를 읽는 것은 LoadLoad 또는 LoadStore 장벽을 필요로한다고 가정합니다. 페이지 아래로 내려 가면 LoadLoad 및 LoadStore가 X86 CPU에서 효과적으로 작동하지 않는 것을 알 수 있습니다. 이것은 x86에서 명시적인 캐시 무효화없이 휘발성 읽기 작업을 수행 할 수 있고 일반 변수 읽기만큼 빠르다는 것을 의미합니까 (휘발성의 재정렬 제약 조건 무시)?

나는 이것을 올바르게 이해하지 못한다고 생각합니다. 누군가 나를 깨우쳐 줄 수 있습니까?

편집 : 다중 프로세서 환경에 차이가 있는지 궁금합니다. 단일 CPU 시스템에서 CPU는 John V.가 말한 것처럼 자체 스레드 캐시를 볼 수 있지만 다중 CPU 시스템에서는 이것이 충분하지 않고 주 메모리에 도달해야하는 CPU에 대한 구성 옵션이 있어야하므로 휘발성이 느려집니다. 다중 CPU 시스템에서 그렇죠?

추신 : 이것에 대해 더 배우기 위해 나는 다음과 같은 훌륭한 기사에 대해 우연히 발견했습니다.이 질문이 다른 사람들에게 흥미로울 수 있기 때문에 여기에서 내 링크를 공유하겠습니다.


Intel에서 경합되지 않는 휘발성 읽기는 매우 저렴합니다. 다음과 같은 간단한 경우를 고려하면 :

public static long l;

public static void run() {        
    if (l == -1)
        System.exit(-1);

    if (l == -2)
        System.exit(-1);
}

Java 7의 어셈블리 코드 인쇄 기능을 사용하면 run 메소드가 다음과 같이 보입니다.

# {method} 'run2' '()V' in 'Test2'
#           [sp+0x10]  (sp of caller)
0xb396ce80: mov    %eax,-0x3000(%esp)
0xb396ce87: push   %ebp
0xb396ce88: sub    $0x8,%esp          ;*synchronization entry
                                    ; - Test2::run2@-1 (line 33)
0xb396ce8e: mov    $0xffffffff,%ecx
0xb396ce93: mov    $0xffffffff,%ebx
0xb396ce98: mov    $0x6fa2b2f0,%esi   ;   {oop('Test2')}
0xb396ce9d: mov    0x150(%esi),%ebp
0xb396cea3: mov    0x154(%esi),%edi   ;*getstatic l
                                    ; - Test2::run@0 (line 33)
0xb396cea9: cmp    %ecx,%ebp
0xb396ceab: jne    0xb396ceaf
0xb396cead: cmp    %ebx,%edi
0xb396ceaf: je     0xb396cece         ;*getstatic l
                                    ; - Test2::run@14 (line 37)
0xb396ceb1: mov    $0xfffffffe,%ecx
0xb396ceb6: mov    $0xffffffff,%ebx
0xb396cebb: cmp    %ecx,%ebp
0xb396cebd: jne    0xb396cec1
0xb396cebf: cmp    %ebx,%edi
0xb396cec1: je     0xb396ceeb         ;*return
                                    ; - Test2::run@28 (line 40)
0xb396cec3: add    $0x8,%esp
0xb396cec6: pop    %ebp
0xb396cec7: test   %eax,0xb7732000    ;   {poll_return}
;... lines removed

getstatic에 대한 2 개의 참조를 보면, 첫 번째는 메모리에서로드를 포함하고, 두 번째는 이미로드 된 레지스터에서 값이 재사용되므로로드를 건너 뜁니다 (long은 64 비트이고 내 32 비트 랩톱에서는 2 개의 레지스터를 사용합니다).

l 변수를 휘발성으로 만들면 결과 어셈블리가 다릅니다.

# {method} 'run2' '()V' in 'Test2'
#           [sp+0x10]  (sp of caller)
0xb3ab9340: mov    %eax,-0x3000(%esp)
0xb3ab9347: push   %ebp
0xb3ab9348: sub    $0x8,%esp          ;*synchronization entry
                                    ; - Test2::run2@-1 (line 32)
0xb3ab934e: mov    $0xffffffff,%ecx
0xb3ab9353: mov    $0xffffffff,%ebx
0xb3ab9358: mov    $0x150,%ebp
0xb3ab935d: movsd  0x6fb7b2f0(%ebp),%xmm0  ;   {oop('Test2')}
0xb3ab9365: movd   %xmm0,%eax
0xb3ab9369: psrlq  $0x20,%xmm0
0xb3ab936e: movd   %xmm0,%edx         ;*getstatic l
                                    ; - Test2::run@0 (line 32)
0xb3ab9372: cmp    %ecx,%eax
0xb3ab9374: jne    0xb3ab9378
0xb3ab9376: cmp    %ebx,%edx
0xb3ab9378: je     0xb3ab93ac
0xb3ab937a: mov    $0xfffffffe,%ecx
0xb3ab937f: mov    $0xffffffff,%ebx
0xb3ab9384: movsd  0x6fb7b2f0(%ebp),%xmm0  ;   {oop('Test2')}
0xb3ab938c: movd   %xmm0,%ebp
0xb3ab9390: psrlq  $0x20,%xmm0
0xb3ab9395: movd   %xmm0,%edi         ;*getstatic l
                                    ; - Test2::run@14 (line 36)
0xb3ab9399: cmp    %ecx,%ebp
0xb3ab939b: jne    0xb3ab939f
0xb3ab939d: cmp    %ebx,%edi
0xb3ab939f: je     0xb3ab93ba         ;*return
;... lines removed

이 경우 변수 l에 대한 getstatic 참조는 모두 메모리에서로드를 포함합니다. 즉, 값은 여러 휘발성 읽기에 걸쳐 레지스터에 보관 될 수 없습니다. 원자 적 읽기가 있는지 확인하기 위해 값을 주 메모리에서 MMX 레지스터 movsd 0x6fb7b2f0(%ebp),%xmm0로 읽어 읽기 작업을 단일 명령으로 만듭니다 (이전 예제에서 64 비트 값은 일반적으로 32 비트 시스템에서 두 개의 32 비트 읽기가 필요함을 보았습니다).

So the overall cost of a volatile read will roughly equivalent of a memory load and can be as cheap as a L1 cache access. However if another core is writing to the volatile variable, the cache-line will be invalidated requiring a main memory or perhaps an L3 cache access. The actual cost will depend heavily on the CPU architecture. Even between Intel and AMD the cache coherency protocols are different.


Generally speaking, on most modern processors a volatile load is comparable to a normal load. A volatile store is about 1/3 the time of a montior-enter/monitor-exit. This is seen on systems that are cache coherent.

To answer the OP's question, volatile writes are expensive while the reads usually are not.

Does this mean that volatile read operations can be done without a explicit cache invalidation on x86, and is as fast as a normal variable read (disregarding the reordering contraints of volatile)?

Yes, sometimes when validating a field the CPU may not even hit main memory, instead spy on other thread caches and get the value from there (very general explanation).

However, I second Neil's suggestion that if you have a field accessed by multiple threads you shold wrap it as an AtomicReference. Being an AtomicReference it executes roughly the same throughput for reads/writes but also is more obvious that the field will be accessed and modified by multiple threads.

Edit to answer OP's edit:

Cache coherence is a bit of a complicated protocol, but in short: CPU's will share a common cache line that is attached to main memory. If a CPU loads memory and no other CPU had it that CPU will assume it is the most up to date value. If another CPU tries to load the same memory location the already loaded CPU will be aware of this and actually share the cached reference to the requesting CPU - now the request CPU has a copy of that memory in its CPU cache. (It never had to look in main memory for the reference)

There is quite a bit more of protocol involved but this gives an idea of what is going on. Also to answer your other question, with the absence of multiple processors, volatile reads/writes can in fact be faster then with multiple processors. There are some applications that would in fact run faster concurrently with a single CPU then multiple.


In the words of the Java Memory Model (as defined for Java 5+ in JSR 133), any operation -- read or write -- on a volatile variable creates a happens-before relationship with respect to any other operation on the same variable. This means that the compiler and JIT are forced to avoid certain optimisations such as reordering instructions within the thread or performing operations only within the local cache.

Since some optimisations are not available, the resulting code is necessarily slower that it would have been, though probably not by very much.

Nevertheless you shouldn't make a variable volatile unless you know that it will be accessed from multiple threads outside of synchronized blocks. Even then you should consider whether volatile is the best choice versus synchronized, AtomicReference and its friends, the explicit Lock classes, etc.


Accessing a volatile variable is in many ways similar to wrapping access to an ordinary variable in a synchronized block. For instance, access to a volatile variable prevents the CPU from re-ordering the instructions before and after the access, and this generally slows down execution (though I can't say by how much).

More generally, on a multi-processor system I don't see how access to a volatile variable can be done without penalty -- there must be some way to ensure a write on processor A will be synchronized to a read on processor B.

참고URL : https://stackoverflow.com/questions/4633866/is-volatile-expensive

반응형