TSX-NI allows the hardware to handle memory operations that had to previously be performed by hand.
One of the greatest challenges faced by programmers is concurrency. Contrary to what most people on the web would have you believe, concurrency is extremely difficult.
On the x86 platform (most other platforms are fairly similar, so this knowledge is rather universal) each logical processor work on one thread at a time. Microprocessors that expose two threads per physical core (Hyperthreading, AKA SMT) are no different. From now on I will refer only to logical processors.
When one logical processor works on a thread, there's no inherent synchronization with the workload executed by another logical processor. If the two workloads have completely disjoint memory spaces (they don't share any memory) then this poses no problem. This is typical of a microprocessor with multiple logical processors running multiple programs that are single threaded. Each logical processor can access the memory of the thread that it is working on without interference from the thread that the other logical processor is working on.
When programs are multi-threaded, there are a number of approaches that can be taken based on the workload that is expected. The easiest way is to simply duplicate the entire address space and have each thread work on its own chunk, again without interference. This is commonly done by webservers handling multiple HTTP requests. However, this approach isn't very effective for applications that require the work performed by multiple threads to be stitched back together, such as executing a game.
For these more complicated concurrency tasks it makes sense to have multiple threads share some sections of memory. As mentioned above, there's no inherent synchronization performed between logical processors. If one logical processor experiences an interrupt or a block, its execution may fall behind that of the other processor. This leads to what are called "race conditions" in which the unsynchronized execution between two or more threads leads to different non-deterministic outcomes.
For example, lets take a simple two threaded program. Thread A prints the letter 'A' to the console, and thread B prints the letter 'B' to the console. Both threads are started simultaneously. If the threads were syncronized, the output would look something like this:
ABABABABAB
However, it will actually look something like this:
ABBAABABAA
This occurs because A and B are both racing to make the system call which prints their respective letters to the console. The first thread to make the system call gets to 'block' the other. The operating system handles the console output serially. The second thread is prevented from furthering its execution until the system call from the first thread completes. System calls are assisted by hardware and handled by the operating system so even if there were 26 threads from A to Z all trying to print to the console, there's no way that they can directly interfere with each other's operation.
Lets remove the system calls and the operating system from the equation and have threads A and B interact directly without any hardware assistance. Imagine now that there's a 10 byte long memory space between them with the values 1 through 10 initialized respectively. Thread A loads the value from each memory location and adds the value 5 to it. Thread B loads the value from each memory location and doubles it.
For thread A, the results would be 6,7,8...15. For thread B, the results would be 2,4,6...20.
Threads A and B are started at the exact same time.
Thread A wins the race condition and manages to load and write the first value, so the first address goes from 1 -> 6.
Thread A is then interrupted and has to pause execution. Thread B, running on another logical processor, continues running.
Thread B loads the first value and doubles it. However, thread A had already gotten there. So instead of 1 -> 2, we get 6 -> 12. Thread B continues execution and loads the second value, doubling it from 2 -> 4. Thread B loads the third value, 3, but is interrupted before it can double it and write it back to the shared memory.
Thread A resumes execution and loads the the second value. We presumed above that it would be 2, but B wrote it as 4. So 4 -> 9. Thread A then loads the third value, 3, adds 5 and writes 8.
Thread B resumes execution where it left off, writing 6 to the third address. This overwrites the 8 just written by A.
In the example above, A and B have no mutual respect for each other. They end up overwriting data, ultimately corrupting it.
In order to deal with this, programmers use a method called "locking" to prevent shared data from being corrupted. The first action taken on a range of shared memory is to lock it and acquire a mutex (mutual exclusive). The final operation is to unlock it. Locking is not hardware assisted, so each thread must check for the presence of a lock before it actually tries to lock the memory itself.
So for any arbitrarily sized chunk of shared memory, the process looks like so:
Check for lock
If locked, wait for unlock (blocked)
If not locked, lock it
Do stuff
Unlock
For very small chunks of memory, that's quite a bit of work. It's not a lot of work for very large chunks of memory, but locking very large chunks of memory can be very inefficient in highly threaded applications.
Furthermore, locking is only designed to prevent other threads from corrupting shared memory through unsyncronized writes. It does not prevent other threads from reading the locked data (and it shouldn't).
Introducing the Transaction. Transactions are an interesting CS concept. In summary, a transaction is any set of operations taken on a set of data that can be considered to be isolated (the changes to the data come only from the transction) and complete (not partial). Transactions have long been a staple of databases, but the same logic can be applied to shared memory.
TSX-NI allows for threads to declare a section of memory as being transactional in nature. A thread can begin a transaction on that section of memory after which it will not change regardless of what any other thread does (isolated). It can be written to arbitrarily, but those writes will not be visible to threads running on other logical processors until that transaction is completed. If the transaction is aborted, any changes to that memory are rolled back.
Going back to the example above, A can obtain a transaction lock on the shared memory space and write to it. When it completes the transaction, all changes will made simultaneously. B can then start a transaction, and do its own work.
This relieves the programmer from having to do the complicated software lock method mentioned above. It also allows for threads to start transactions on the same region provided that their operations don't conflict with eachother. A could work on the first 5 bytes, while B works on the last 5.
Ultimately, TSX-NI is a very interesting tool that will have some very profound effects on highly concurrent data management systems just like how AES-NI had very profound effects on encryption systems.
I hope that this answered your question.