Difference between TSX-NI and Hyper Threading?

Ragnarous

Distinguished
Sep 15, 2013
630
0
19,060
Hello, I've seen an I5 processor with TSX-NI technology that didn't have Hyper Threading Technology and an i7 that had the Hyper Threading but not the TSX-NI.. Why didn't the I7 had both and what exactly is the difference between those technologies?
Thank you!
 
To my knowledge TSX only has an impact in certain business applications, where Hyperthreading can be used far more readily.

Here is a long post from Pinhedd from 2013, to be honest, I havent read it all yet:
TSX-NI allows the hardware to handle memory operations that had to previously be performed by hand.

One of the greatest challenges faced by programmers is concurrency. Contrary to what most people on the web would have you believe, concurrency is extremely difficult.

On the x86 platform (most other platforms are fairly similar, so this knowledge is rather universal) each logical processor work on one thread at a time. Microprocessors that expose two threads per physical core (Hyperthreading, AKA SMT) are no different. From now on I will refer only to logical processors.

When one logical processor works on a thread, there's no inherent synchronization with the workload executed by another logical processor. If the two workloads have completely disjoint memory spaces (they don't share any memory) then this poses no problem. This is typical of a microprocessor with multiple logical processors running multiple programs that are single threaded. Each logical processor can access the memory of the thread that it is working on without interference from the thread that the other logical processor is working on.

When programs are multi-threaded, there are a number of approaches that can be taken based on the workload that is expected. The easiest way is to simply duplicate the entire address space and have each thread work on its own chunk, again without interference. This is commonly done by webservers handling multiple HTTP requests. However, this approach isn't very effective for applications that require the work performed by multiple threads to be stitched back together, such as executing a game.

For these more complicated concurrency tasks it makes sense to have multiple threads share some sections of memory. As mentioned above, there's no inherent synchronization performed between logical processors. If one logical processor experiences an interrupt or a block, its execution may fall behind that of the other processor. This leads to what are called "race conditions" in which the unsynchronized execution between two or more threads leads to different non-deterministic outcomes.

For example, lets take a simple two threaded program. Thread A prints the letter 'A' to the console, and thread B prints the letter 'B' to the console. Both threads are started simultaneously. If the threads were syncronized, the output would look something like this:

ABABABABAB

However, it will actually look something like this:

ABBAABABAA

This occurs because A and B are both racing to make the system call which prints their respective letters to the console. The first thread to make the system call gets to 'block' the other. The operating system handles the console output serially. The second thread is prevented from furthering its execution until the system call from the first thread completes. System calls are assisted by hardware and handled by the operating system so even if there were 26 threads from A to Z all trying to print to the console, there's no way that they can directly interfere with each other's operation.

Lets remove the system calls and the operating system from the equation and have threads A and B interact directly without any hardware assistance. Imagine now that there's a 10 byte long memory space between them with the values 1 through 10 initialized respectively. Thread A loads the value from each memory location and adds the value 5 to it. Thread B loads the value from each memory location and doubles it.

For thread A, the results would be 6,7,8...15. For thread B, the results would be 2,4,6...20.

Threads A and B are started at the exact same time.

Thread A wins the race condition and manages to load and write the first value, so the first address goes from 1 -> 6.

Thread A is then interrupted and has to pause execution. Thread B, running on another logical processor, continues running.

Thread B loads the first value and doubles it. However, thread A had already gotten there. So instead of 1 -> 2, we get 6 -> 12. Thread B continues execution and loads the second value, doubling it from 2 -> 4. Thread B loads the third value, 3, but is interrupted before it can double it and write it back to the shared memory.

Thread A resumes execution and loads the the second value. We presumed above that it would be 2, but B wrote it as 4. So 4 -> 9. Thread A then loads the third value, 3, adds 5 and writes 8.

Thread B resumes execution where it left off, writing 6 to the third address. This overwrites the 8 just written by A.

In the example above, A and B have no mutual respect for each other. They end up overwriting data, ultimately corrupting it.

In order to deal with this, programmers use a method called "locking" to prevent shared data from being corrupted. The first action taken on a range of shared memory is to lock it and acquire a mutex (mutual exclusive). The final operation is to unlock it. Locking is not hardware assisted, so each thread must check for the presence of a lock before it actually tries to lock the memory itself.

So for any arbitrarily sized chunk of shared memory, the process looks like so:

Check for lock

If locked, wait for unlock (blocked)

If not locked, lock it

Do stuff

Unlock

For very small chunks of memory, that's quite a bit of work. It's not a lot of work for very large chunks of memory, but locking very large chunks of memory can be very inefficient in highly threaded applications.

Furthermore, locking is only designed to prevent other threads from corrupting shared memory through unsyncronized writes. It does not prevent other threads from reading the locked data (and it shouldn't).

Introducing the Transaction. Transactions are an interesting CS concept. In summary, a transaction is any set of operations taken on a set of data that can be considered to be isolated (the changes to the data come only from the transction) and complete (not partial). Transactions have long been a staple of databases, but the same logic can be applied to shared memory.

TSX-NI allows for threads to declare a section of memory as being transactional in nature. A thread can begin a transaction on that section of memory after which it will not change regardless of what any other thread does (isolated). It can be written to arbitrarily, but those writes will not be visible to threads running on other logical processors until that transaction is completed. If the transaction is aborted, any changes to that memory are rolled back.

Going back to the example above, A can obtain a transaction lock on the shared memory space and write to it. When it completes the transaction, all changes will made simultaneously. B can then start a transaction, and do its own work.

This relieves the programmer from having to do the complicated software lock method mentioned above. It also allows for threads to start transactions on the same region provided that their operations don't conflict with eachother. A could work on the first 5 bytes, while B works on the last 5.

Ultimately, TSX-NI is a very interesting tool that will have some very profound effects on highly concurrent data management systems just like how AES-NI had very profound effects on encryption systems.

I hope that this answered your question.
 

Hah!

Now find one of my many posts on SMT AKA Hyperthreading and I will have answered OPs question without having provided any direct input whatsoever!
 
*cough*
What you describe is closer to coarse-grained multithreading, not simultaneous multithreading. I'll cover the big three concepts behind hardware threading:

CGMT, or Coarse-Grained MultiThreading

FGMT, or Fine-Grained MultiThreading

SMT, or Simultaneous MultiThreading

From the perspective of user-mode software (I won't discuss kernel stuff as that can get a bit tricky) all appear the same, as two or more logical processors. The relationship between a logical processor and its physical hardware can usually be divined by looking at various identifiers such as the APIC ID and very high quality software will usually calibrate itself based on the arrangement of sockets, cores, and logical processors. However, there is no need to do this unless one wants to optimize performance.

Under the CGMT scheme the microprocessor tracks two or more thread contexts (one on each logical processor), and works on each thread in chunks. The microprocessor switches logical processors whenever a logical processor stalls out, (such as on a cache miss), blocks (such as on a lengthy IO operation), or after an allotted number of cycles to prevent starvation.

Under a non-multithreaded scheme a cache miss or a block can be resolved two ways. Either the microprocessor inserts stall cycles until the thread can continue, or the operating system performs a context switch which replaces the running thread on the logical processor with a new one. In most cases a cache miss will be resolved faster than the operating system can load a new context onto the logical processor (which in itself is highly likely to result in a cache miss) so a cache miss or an uncompensatable hazard in a non-threaded environment almost always results in stall cycles. However, a CGMT microprocessor can switch logical processors very, very quickly, and do so in a fashion that is completely transparent to the operating system. Thus, instead of stalling, the microprocessor switches to another context that is already loaded on another logical processor.

Under CGMT, whenever a certain stall threshold is met, the microprocessor switches execution to another logical processor so that it has something to do. When that logical processor stalls or exceeds its cycle allocation it switches again, either back to the first thread or to the next one. Even though the microprocessor is executing instructions from a separate logical processor, the memory manager will continue to resolve the miss for the processor that stalled.

FGMT takes this concept a little bit further and alternates execution between logical processors on every cycle. Ideally a stall would be resolved by the time that the microprocessor returns to the thread that stalled, but if it is not, it can be skipped. This is advantageous when execution resources are much cheaper and more plentiful than fast memory. Processors of this style are known as barrel processors because they rotate execution over the logical processors in a cyclic fashion. This style of execution was very popular in the 1990s and early 2000s.

SMT is the apex of multithreading, and is exclusive to superscalar microarchitectures (although FGMT/CGMT can also operate on superscalar architectures). The microprocessor issues instructions to the execution pipes (a feature of superscalar microarchitectures) from all logical processors in a dynamic fashion. If one logical processor stalls, blocks, or is idled for performance reasons, the microprocessor will dedicate all resources to the remaining logical processors. As the number of logical processors per core grows, it becomes easier and easier to keep the execution pipes on that core busy 100% of the time. This is very desirable in throughput sensitive environments such as application servers and databases, but can be detrimental to real-time sensitive environments such as gaming. This is why many game developers configure thread affinity on Hyperthreaded microprocessors to avoid fighting for resources.

I hope that this was informative.