Samsung Cramps 24-SSD RAID Experiment

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
ossie, you've lost me. It was a 64-CPU system! Bandwidth scales
with the number of CPUs. Be skeptical if you like, but that's how
it worked. We're takling an 8-rack machine here, not a desktop. 😀
It was done with a large number of FC connections.

The system bandwidth was even higher, 80GB/sec. The later O3K
system is capable of being much faster again, and ditto Altix.
Comparing to some modern chip's L1 cache is a tad irrelevant IMO.
See the original document on my site. After 9/11, SGI stopped publishing
details on what its high-end systems are capable of for defense
imaging - they never published an equivalent paper for Origin3K,
which in theory could be 10X faster quite easily. The max sustained
bandwidth of a 512-CPU Origin3K is 716GB/sec. They didn't publish
numbers for the 1024 or 2048-CPU configs (should just scale
accordingly).


Re the LSI, what I meant was, when I setup a hw RAID it will
only let me do this using disks from a single channel, whereas
it would be faster to use both channels, alternating accesses
back & forth across the controllers. I don't get why the SCSI
BIOS doesn't allow this. The card is (now) a 22320-R PCIX.

Ian.

 
> I don't get why the SCSI BIOS doesn't allow this.


Ian,

I'm guessing here, because I don't have extensive SCSI experience.
So, please forgive me if I my "guess" is wrong.

(Sometimes found errors lead us to better solutions:
one of my best teachers once said, "Failure is postponed success.")


I think SCSI was designed to "chain" lots of devices
on a single controller port and, as such, the
SCSI bus acts in a manner similar to the old PCI bus --
only one device on the "chain" can be addressed at any given moment.

It would be an interesting, and relatively cheap, experiment
to outfit a PCI-E motherboard with, say, three x4 RAID controllers e.g.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816115057

... to determine if the PCI-E bus is running them in parallel --
effectively allocating x12 PCI-E lanes to that storage subsystem.

Some of the more expensive RocketRAID controllers can also
be "teamed" e.g.:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816115056

http://www.highpoint-tech.com/USA/rr4320.htm

http://www.highpoint-tech.com/PDF/RR4320/RocketRAID4300_Datasheet.pdf

• Multiple card support within a system for large storage requirements


I would have to ask Highpoint if "multiple card support" also
means that 2 or more RocketRAID controllers operate in parallel.

We have a RocketRAID 2340 running quite well:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816115031&Tpk=N82E16816115031

Note well the two (2) discrete I/O processors on that board.

I believe our current throughput is limited by chipset lane
assignments, however, and not by the controller itself.

When I review PCI-E motherboards now, if they have multiple
x16 mechanical slots, I look further to see if some of those
are limited to x8 or x4 lanes by the chipset.

Also, we are running 8 x Western Digital 7,200 rpm SATA/3G HDDs
and there are now much faster HDDs compatible with that same
controller. Nevertheless, our 5GB database updates much faster
using the RocketRAID 2340 and 8 x HDDs.


Thanks for the insights!


MRFS
 
Ian, all right, but it's a very particular case. No way for a general purpose architecture to reach such performance. Aggregate BW for a large system is scaling with CPU number and IO, RAM channels, and the application SGI described (image viewing with limited processing) is easily parallelized. Interesting, IO BW was almost double the RAM BW... clearly a very specialized architecture for a single purpose.
If I remember correctly, these were the days (2000) when following HW was available: SCSI HDD 10k/36GB (~40MB/s peak), FC 1Gb/s (Seagate's X15 was still 1 year away)
For 80GB/s BW: 320 2xFC HBAs, and ~2000 HDDs (the 72TB max. figure) would be necessary...

The LSI HBA (MPT SCSI) is very limited in what it's BIOS can do/offer.
I think creating logical drives across SCSI channels is possible on the MegaRAID 320-2x...
I have a MegaRAID 320-0x in an older server with a 22320 on-board, and, if you want, I could try it.
 
MRFS, I suspect the controller/cache design on SCSI disks does
at least allow some buffering to occur, so the next write/read
can begin while the previous one can begin. This is why alternating
controllers and cards is so effective.

ossie, not sure what you mean by general purpose. The high I/O
database setups used on SGIs apply to much more than defense
imaging. The same configs are used for medical databases, GIS,
etc., and to a lesser extent, video (7.5K Panavision needs quite
a lot of bw). And note that the defense imaging application can
indeed involve a lot of processing, but that's done mostly by
the gfx hw (8 parallel IR pipes). In that sense, the hw they
built is general purpose as it's used by many different industries.
The system was all FC though, not SCSI (presumably 10K FC). I
expect they used the drive models with large cache RAMs - I
remember seeing an advert for an expensive FC drive that had a
72MB cache buffer.

Full O2K info is on my site.

O3K greatly improved the scalability while also vastly reducing
the latency penalty for remote accesses. On a max-spec O3K, the
latency for accessing the most remote CPU's RAM is only 2X more
than accessing a node's local RAM (quite different to O3K).

On paper at least, O3K should be far more effective for the
defense imaging stuff, but like I say SGI never published newer
specs for the Group Station's abilities. The sw was called the
Electronic Light Table.

(did you read the PDF btw?)

Ian.

 
Ian, by general purpose I mean a system architecture which is designed to process different tasks, no (extreme) specialization. The case of the SGI imaging system is very particular, as it's designed for limited graphics processing, where it could easily be parallelized - the system was partitioned so that every part was responsible for it's tile. As long as each part is responsible for operations strictly/mostly for it's tile, high aggregate performance is attainable.
One "simple" operation, that comes to mind, which would get the system to it's knees, would be to rotate the entire bitmap - all the parallelization features would be very poorly utilized.
As I see it, every CPU was responsible for IO and look-up operations on it's tile, and all graphic processing was done in the IR units, fed by the CPUs, to display the desired data.
For operations on a stream of data, all the parallelization would be of no use, as the IO would be the bottleneck, even if further processing could be parallelized.
The HW blocks used are general purpose, but the way they're used and put together is not. If you would use that system for a different purpose, with very high probability it would not be efficient (especially the IO part).

The HDDs and the protocol _are_ still SCSI, even if the physical connection is FC (copper/FO). HBAs and devices use base2 FC. Base10 FC is used only in inter-switch links.
At the time, only FC1 was available. http://www.fibrechannel.org/OVERVIEW/Roadmap.html

Yes, I've read it. In light of it's marketing droid language, and the fact that IO BW is almost double the RAM one, I'm inclined to halve the HBA number figure - they almost surely put the FDX aggregate BW figure in the paper.

As i understand, the O2/3 is a NUMA, not a cluster, a single OS instance controlling the whole system. With the transition to Itanium was it abandoned? Their lower PC stuff is clearly cluster.
 
ossie,


> Ian, by general purpose I mean a system architecture which is
> designed to process different tasks, no (extreme) specialization. The
> case of the SGI imaging system is very particular, ...

Not at all, the same system can be used for a very wide variety of
tasks. High-end 2D imaging is just one of them. Everything from
uncompressed video (Inferno, Fire), seismic & medical imaging (volume
data) to visual simulation and VR - all very different tasks, but
each benefiting in their day from how high-end SGIs and the relevant
gfx worked. eg. in the case of the GroupStation, terrain data can be
combined with image data for mission rehearsal, or fed into visual
simulation scenarios, coupled with application environments like Vega
& Multigen, displayed in stereo for VR terminals, fed to remove devices
and low-end systems with VizServer, etc.

By contrast, E&S vis sim systems (SGI's main rival years ago) were
indeed very specific to one application, ie. vis sim. Their design was
tightly coupled to intended function of the system, eg. expansion
achieved by adding more 'eyepoint generators', vs. SGI's more general
approach.


> One "simple" operation, that comes to mind, which would get the
> system to it's knees, would be to rotate the entire bitmap - all the
> parallelization features would be very poorly utilized.

Actually no, not as bad as you think. Remember the system has Clip
Mapping (plus other functions) which can be used to access only those
parts of a larger database which are required, the system can build a
mipmap structure to enable multiple scaled versions to be worked on,
and most important of all the gfx supports the OGL ARB extensions, so
a lot of these ops are hw accelerated (helped by the system having
multiple gfx pipes working in parallel). Indeed, ELT does this even
on the desktop systems, eg. rotating a 50MB image (4K pixels wide) on
an Octane MXI works perfectly, even though MXI only has 4MB texture
RAM. Likewise, the O2 system has in incredibly low conventional
memory bandwidth compared to PCs (terrible STREAM result, only
80MB/sec at best), yet an O2 can rotate an 730MB 2D image in
real-time just fine (16000 pixels across; I tried it) - try that on a
PC and see what happens. Infact the GroupStation PDF mentions
interactive roam/pan/zoom/rotate as one of the particular functions
supported by ELT.


> The HW blocks used are general purpose, but the way they're used and
> put together is not. If you would use that system for a different
> purpose, with very high probability it would not be efficient
> (especially the IO part).

Not at all. Remember it's a shared memory system, so CPUs can work on
data held by remote nodes no problem; there's a latency penalty, but
it's not that bad and can be mitigated with extra inter-node Xpress
links. That's the whole point of the architecture, and why they ran
so well for many tasks. I remember seeing results for a weather
modelling application that ran massively faster on a deskside Origin
than any PC cluster of the day, purely because a CPU could access any
part of a large data set in a manner not possible with a cluster.


> The HDDs and the protocol _are_ still SCSI, even if the physical
> connection is FC (copper/FO). HBAs and devices use base2 FC. Base10
> FC is used only in inter-switch links.

How FC works does make a difference though. The scalability works
better than SCSI. Very high bandwidths are possible even when
individual disks are slow. I've already obtained 500MB/sec with an
Octane even though each disk can only do 35MB/sec. The efficiency of
the OS also helps aswell.


> fact that IO BW is almost double the RAM one, I'm inclined to halve
> the HBA number figure - they almost surely put the FDX aggregate BW
> figure in the paper.

Nope. That is indeed what it can do. Guy at Lockheed confirmed this
to me (they had early access to SGI's products, aswell as building
custom versions). The full duplex number would hardly be useful as PR
when an image is only being loaded (1-way traffic). If anything, the
loading speed is limited on Onyx2 more by the memory I/O, not the
RAID I/O, which is why it would run a lot faster with O3K.


> As i understand, the O2/3 is a NUMA, not a cluster, a single OS
> instance controlling the whole system. ...

Correct, though it can be partitioned if required, benefiting from the
direct links still present.


> ... With the transition to Itanium
> was it abandoned? ...

No, eg. Altix 450 and Altix 4700 systems are NUMA, using NumaLink4,
which is 4X the speed of the port connections used in Onyx2. The older
models originally launched were the Altix3xxx series.

My expectation is they'll switch to i7 XEON for their next NUMA system.


> ... Their lower PC stuff is clearly cluster.

That's the Altix XE/ICE line, yes, though they've done a lot to
provide blade designs that offer higher than normal inter-node
bandwidth and better latency, eg. multiple onboard InfiniBand and
GigE ports. Oh, my hunch was right btw, both PCIX and PCIe expansion
blades are available.

There are also the optional FPGA blades for vastly accelerated
processing of specific codes when possible.

Ian.

PS. The above is not to say SGI never made mistakes (they made lots).
But what they were capable of was very impressive and in some cases
hasn't yet been beaten by modern COTS hardware, not without some heavy
compromises and serious rewriting of code.

 
Ian, what I tried to point out was that the structure of the imaging system, with it's overblown IO system, could have been also used for other purposes, but it's HDD IO subsystem would be barely used as effectively in most other cases. I presume it's cost was a quite high part of that of the whole system.

I still think the performance would've been really hurt if the _whole_ bitmap at it's native resolution would've been rotated, and not just a subset by the IR engines. Yes, NUMA would help, but the operation isn't at all easy parallelized, and most processors would have to access the address space of others, with the implied penalties.
From this respect, NUMA is way better than clustering, even with Infiniband's RDMA - high BW, but still somewhat high latency at ~1us MPI.

FC is geared towards massive distributed ptp communication, but it's just the physical layer and associated low level protocols. The SCSI protocol is a higher level protocol to communicate with the designated peripheral over FC, similarly to iSCSI. Don't make the confusion with the older parallel SCSI bus or newer SAS 1.0. With the latest revision 2.0 of SAS, routed expanders are included, which will get SAS in competition with FC...

Why would you think PR wouldn't use the FDX figure? Higher numbers are always better... in their view. The initial FC HBA number estimation was done on 80GB/s SPX BW, but RAM BW was only 44GB/s. Even with higher IO BW, the exceeding data had nowhere to go. The 67GB in 2s demo means 33.5GB/s, that would fit easily in 40GB/s SPX BW.

So can I presume Altix 4xxx will run just a single OS instance?
As the Itanic history proves it, SGI'll surely jump to something more palatable... Gainestown is just around the corner, and Nehalem has native NUMA support.

I intended to mention FPGAs earlier... :) They are great for HW parallelized tasks: DSP, image processing etc. I'm just working right now on a FPGA project for a client.

ps: SGI made great stuff, a pity they dumped MIPS for Itanic.
 
ossie,


> also used for other purposes, but it's HDD IO subsystem would
> be barely used as effectively in most other cases. ...

On the contrary, one could use the same hw for other tasks no
problem, though I should imagine for a different task it might
be the case that a different block size and allocation unit
size would be preferable - depends on the task - but at he hw
level it's the same kit. How the RAID is setup is just configured
with diskalign and other tools, and one could use the same hw
setup as multiple different XLVs.


> ... I presume it's cost was a quite high part of that of the whole system.

Very likely! 😀


> ... I still think the performance would've been really hurt if
> the _whole_ bitmap at it's native resolution would've been

I think you're missing something: what matters is what is being
viewed on the screen, not the notional idea of the whole image
being rotated. Viewing the whole image means in reality what
you're looking at are clip-mapped mipmap sections displayed via
the 3D gfx, so yes it is real-time. That's how ELT works. As you
zoom in, it seamlessly switches to different different detail
levels.


> ... help, but the operation isn't at all easy parallelized, ...

I would have thought the opposite, splitting the image up is
parallelizable by default.


> ... and most processors would have to access the address space
> of others, with the implied penalties.

Just remember the penalties are not that great. The page migration
hardware and directory-based cache coherency helps a lot here.


> From this respect, NUMA is way better than clustering, even
> with Infiniband's RDMA - high BW, but still somewhat high
> latency at ~1us MPI.

Latency in clusters is a thousand times slower. 😀 Avg. remote
latency for a 128 CPU system of 945ns sounds pretty darned fast
to me. Like I say though, this was improved a lot for O3K, with
the worst case 1024-CPU remote latency being only 2X that of
local memory.



> With the latest revision 2.0 of SAS, routed expanders are
> included, which will get SAS in competition with FC...

I'm surprised it hasn't in a big way already.


> Why would you think PR wouldn't use the FDX figure? Higher
> numbers are always better... in their view. The initial FC HBA

These are technical papers, not pure PR. B.s.'ing in that way
doesn't go down too well with the target audience. 😀


> ... The 67GB in 2s demo means 33.5GB/s, that would fit easily
> in 40GB/s SPX BW.

I'm not sure which number you're querying. The paper suggests
the bottleneck was the memory bw, not the I/O. Plus, it doesn't
specifically say they configured a best-possible I/O system,
merely that I/O can scale up to 82GB/sec.


> So can I presume Altix 4xxx will run just a single OS instance?

Yes. The time taken to port the relevant bits from IRIX to Linux
is why the single-image scalability is still only 2048 CPUs or
somesuch. The original plan for Origin4000 with MIPS/IRIX was
37500 CPUs, but that never happened. All the effort went into
beefing up Linux so it could do what IRIX was capable of.



> As the Itanic history proves it, SGI'll surely jump to
> something more palatable... Gainestown is just around the
> corner, and Nehalem has native NUMA support.

Assuming Nehalem does as well as expected, and the NUMA works
for large-scale systems, I expect they'll switch over. IA64
has a lot of cache RAM, which is expensive, whereas i7 doesn't
need it to perform at the same level.



> I intended to mention FPGAs earlier... :) They are great for HW
> parallelized tasks: DSP, image processing etc. I'm just working
> right now on a FPGA project for a client.

Cool! 8) I know a movie company who had Inferno burned to an FPGA
board years ago, for an Onyx2 system. Some bits ran 100X faster
than normal. They charged $14000/hour for clients to use the
system. 😀


> ps: SGI made great stuff, a pity they dumped MIPS for Itanic.

Indeed. If they'd been able to hold on to the CPU design talent,
and stuck to their guns, things might have been different. Alas,
it must have been hard for even the biggest company fan at the
time to justify turning down a 100% salary increase to their
families just because they'd rather work at SGI instead of Intel.
The guy I knew, who was on the R10K CPU design team, held out
for quite a while, but eventually left; salaries for even menial
jobs were $85/hour.

Ian.

PS. Hmm, reckon we should take this to email? 😀 Getting a tad
off-topic really...

 
Email you mean? Sure, it's mapesdhs@yahoo.com

Btw, one thing I forgot to mention - the 945ns latency in O2K is
the worst-case scenario for remote CPU/RAM access. Vast majority
of the time, remote RAM accesses will involve latencies nowhere
near this bad. However, worst case in O2K was still an unpleasant
chunk higher than a local node access, so O3K halved the worst-
case penalty even though the scalability is an order of magnitude
higher.

More details here:

http://www.sgidepot.co.uk/origin/isca.pdf

Ian.

 
Of course this all just sequential read/write performance. Most
users would benefit a lot more from good random read/write speed.
Randrom read is good on most SSDs, though it does vary. However,
random write speed on SSDs is often pretty woeful. Don't be fooled
by just a sequential read speed result. Check the article on
Anandtech for full details.

Ian.

 
Sorry but it doesn't work. The Vertex drive still has a very
poor random write speed. Check the Anandtech results.

The Vertex isn't too bad for random read speed and random
latency (slower than the Intel drives, though better than all
the other competition), but the Intel drives are just way ahead
for random write speed (10X better than the Vertex, with the
Vertex not much better than all the others).

Anandtech found the manufacturers are gearing their firmware
far too much towards obtaining high sequential read speed
numbers that are good for nothing except marketing. Once OCZ
had redone the firmware the results were much better, but
the Intel drives are still way ahead.

Ian.

 
So has anyone figured out which RAID controller works best for high phys SSD arrays yet?

I noticed a few SAS SSD drives on the market and this would eliminate the SATA STP overhead on SAS controllers so theoretically making it equal to (or marginally better) SATA SSDs on SATA-only controllers?

I noticed the Samsung PB22-J 64GB SSDs (albiet SATA, MLC) is quite lowly priced that it is comparable to a second-hand 15krpm SAS 2.5" drive on ebay - so I'm tempted to grab a bunch this summer with a SATA card and compare against my 2.5" array. Only problem is finding the time to do it 🙁
 
This article about the video is incorrect in so many ways! Very bad research and writing! The video does tell the RAID controller used and shows the configuration for all of the test conditions!

Here is a thread I started with the relevant screen shots taken directly from the video so the article researchers here don't actually have to do any research.

http://forums.macrumors.com/showthread.php?t=733657

Terrible level of incompetence here guys! Come on! This isn't up to snuff for a "Tom's Hardware" article!
 
What if they were to grab a single x16 Raid card and throw it into an open x16 slot that isn't being used (with only one video card). Shouldn't that give them the bandwidth they need alone?


For instance, I'm using an Intel x4 controller in a free x4 PCI-E slot on my ASUS Rampage III Gene board. I've hit a cap of 640MB/s. Theortically, i could simply upgrade the raid card to an x16 one, throw it into an open x16 slot and get full potential out of my 8x intel SSDs
 
I am not at all impressed by this performance. In fact this is pathetic. I build data loggers as part of my job. In my last built I used 1 RAID controller and 8 15K RPM Seagate drives and was able to hit well over 1,600MB/s. That is over 200MB/s per drive. I can get their same rate of 2000MB/s or 2GB/s with only 10 drives. The fact that they need 24 is ridiculous. I have a system with 32 15K RPM drives and that thing caps at 4GB/s due to bandwidth on the PCI-E bus. Not from the drives, so this is not at all very compelling considering I can buy the 10 15K RPM drives for $4,500 and the 24 SSD drives will cost you $12,480. What a waste of time. With 24 drives only getting 2GB/s thats only 91.5MB/s PATHETIC! My 7200RPM Seagate drive get 130MB/s real world performance.
 
1) Your 15k drives are SCSI and I'm betting that you're using an iSCSI offload controller in a SAN or server board.

This test is completely different and it involves squeezing max performance out of a desktop board with desktop drives. genius
 
Status
Not open for further replies.