The increasing computational and communication demands of the scientific and industrial communities require a clear understanding of the performance trade-off involved in multi-core computing platforms. Analysis can help application and toolkit developers in designing better, topology aware, communication primitives intended to suit the needs of various high end computing applications. We take on the challenge of designing and implementing a portable intra-core communication framework for streaming computing and evaluate its performance on some popular multi-core architectures developed by Intel, AMD and Sun.
Analysis shows that a single communication method cannot achieve the best performance for all message Sizes. The basic memory-based copy method gives good performance for small messages. Vector loads and stores provide better performance for medium sized messages. Kernel based approach can achieve better performance for larger messages.
A multi-core processor is a processing system Composed of two or more independent cores. As number cores increase the performance also increase. A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient.
Now a days, multi-core processors are used in case of high performance computing. When more and more users attempting to use their distributed workload within a single node or a small set of nodes and it causes a little communication delay, it decrease performances. The hardware support for accelerating it, have become critical to obtain optimal application performance. Because it cause a little communication delay.
The inter-node data exchange method such as Myrinet or InfiniBand. Myrinet has less overhead protocols than Ethernet. So it provide better performance ,less interface and low latency. It uses two optical fiber. One used for Upstream and other used for Downstream and connected with node with switch. It has latency of 5.5 microsecond and software and hardware cost is high. Infiband is a high speed interface for server and peripherals. It deal with connection between server to server and peripherals such as storage switches etc. The main disadvantage is software overhead.
Here we deals with the approach used to evaluate the communication mechanisms of various multicore architectures. For Basic Memory Based Copy, we use the basic memcpy function to copy data from the user buffer to the shared communication buffer and back to the user buffer on the receiving side. This form of data transfer would take paths 3 to 8.
The Intel SIMD Extensions 2 (SSE2) is the Intel SIMD instruction.(ie) Single instruction for multiple data. These instruction are used in case of multiprocessing system. Multiprocessing unit perform same operation on set of different data. SSE2 includes two Vector instruction set one is movdqua and other is movdqu. This instruction transfer 16 Bytes data from source buffer to destination buffer.
- Movdqua - This instruction requires 16 bytes aligned memory location.
- Movdqu - This can be either aligned or unaligned.
These are shown in 3 to 8 path. SSE2 includes a set of streaming instructions that do not pollute the cache hierarchy. The streaming instructions would follow paths 1 and 2. There are two streaming instructions available in the SSE2 instruction set.
- Movntdt - This is a non temporal store which copies the data from the source address to the destination address without polluting the cache lines. If data are already in cache, then the cache lines are updated. This instruction is capable of copying 16 bytes of aligned data at a time.
- Movnti - This is a non-temporal instruction similar to movntd except that we only copy 4 bytes
The standard memory copy approaches involve two copies to transfer data from the source to the destination buffer. For large messages we opt for a single copy approach with kernel-level memory copies using the LiMIC2 library. The library abstracts out the memory transfers into a few user level functions.