AaCoSICoXX,XXի@SFI!"@FݺFݻFݺFݻff@W
d Footnote
TableFootnote**.\t.\t/ - :;,.!? engduplex$
_l
aTOCHeadingHeading2 LivermoreMflopParsim hypercubesimplexspeedup
topologies
transputerworkstationsAR
AHAAFAN EquationVariablesM 9c5e5t6v6S7U7V8X8f9h9w:y:/Za;0Y>;Aa<<CoX=X=SF>>Fݻ??@@bcBBu]adFoabl$h@Sc^`:;AngdA$l
DCHeCead LiDMfDsimbe6E8E
to
trwo!sF
FAMFrEZ\93F>VC^xtcd!eion*i8nCoMp 9j5}u566H7H78I8I9$J9%J:*K:+K/13 L4 L;8 ;: M; MC <E N<F NJ =L O=M
OQ >S P>T
P] Q?^ Q?b d R@e R@j l Sbm Scs TBt TBy { U| U V] Va Wd W Xh X ! "Y^ #Y` $ZA %ZA & '[ ([ )\ *\C +]C ,] -D .^D /^ 0E 1_E 2_ 3 4`! 5`F 6F &13175: Table1: Table 1: Standard Model3 Fx10760: Reference2+: [6] T. Lewis and H. El-Rewini, Parallax: A tool for parallel program scheduling, IEEE Parallel and z21180: Reference2+: [15] T. Yang and A. Gersoulis, Pyrros: Static task scheduling and code generation for message-passing: Kt21984: Reference2+: [14] M. Y. Wu and D. D. Gajski, Hypertool: A programming aid for message-passing systems, IEEE Ju34287: Reference2+: [8] D. Mitchell et al, Inside The Transputer, Blackwell Scientific Publications, Melbourne, 1990.eR j~41532: Reference2+: [13] O. Tanir and S. Sevinc, Defining requirements for a standard simulation environment, IEEE Computer, w32178: Reference1: [1] A. Averbuch, E. Gabber, B. Gordissky, and Y. Medan, A parallel FFT on a MIMD machine, Parallel m25297: Reference2+: [12] A. Symons and V. Lakshmi Narasimhan, Parsimmessage passing computer simulator, In y31283: Reference2+: [4] M. Heidemann, D. Johnson, and C. S. Burrus, Gauss and the history of the FFT, IEEE Mag., 1, Oct6T. a{28048: Reference2+: [2] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, A{21466: Reference2+: [11] A. Symons, V. L. Narasimhan, and K. Sterzl, Performance analysis of a parallel FFT algorithm on a {10383: Reference2+: [7] F. H. McMahon, The Livermore fortran kernels: A computer test of the numerical performance range, sJ29522: Figure2+: Figure 2: Message-passing Parallel Computer Configuration 1\37413: Figure2+: Figure 3: ProcessorProcessor Interconnectiona Multistage Interconnection; m831923: Figure2+: Figure 4: Configuration Menu for Parsim h524169: Figure2+: Figure 5: Graphical Output of ParsimTn m*39140: Table2+: Table 2: System Parameters2+ Ag38991: Figure2+: Figure 6: Comparison Between Actual FFT Performance and Simulated FFT Performance on aR +l20962: Figure2+: Figure 7: Simulated Livermore Loop Performance on a Transputer Hypercube100,000 Iterations 8p40347: Figure2+: Figure 8: Simulated Livermore Loop Performance on a Transputer HypercubeHigh Performance Links `37575: Figure2+: Figure 9: 8 5 Transputer Mesh Using Step Routing for the First Livermore Loop ap12071: Figure2+: Figure 10: Link Communication Time of 8 5 Transputer Mesh for the First Livermore Loopa Step ! t>17668: Figure2+: Figure 11: Routing on a 5 5 Transputer Mesh F & :V30462: Figure2+: Figure 12: Simulated Livermore Loop Performance on an IBM HPS Networkro - PH20722: Figure2+: Figure 13: Topology Comparison for 4,00 Decision Nodes. 0 Fy18072: Reference2+: [9] L. Ni et al, A survey of wormhole routing techniques in direct networks, Computer, pages 6276,:ab 3 a~23641: Reference2+: [10] J. K. Ousterhout, Tcl and the Tk toolkit, Addison-Wesley Professional Computing Series, Sydney, 1994.n 5 +206!re2 F7 Simat8oreoo9 ancon: ute-]8%e38]-;=&13175: Table1: Table 1: Standard Model L38]t.<>rx10760: Reference2+: [6] T. Lewis and H. El-Rewini, Parallax: A tool for parallel program scheduling, IEEE Parallel and38]9/=> z21180: Reference2+: [15] T. Yang and A. Gersoulis, Pyrros: Static task scheduling and code generation for message-passingp38] 0>>7t21984: Reference2+: [14] M. Y. Wu and D. D. Gajski, Hypertool: A programming aid for message-passing systems, IEEE38] 1?>nu34287: Reference2+: [8] D. Mitchell et al, Inside The Transputer, Blackwell Scientific Publications, Melbourne, 1990.Ren38]!2@>a~41532: Reference2+: [13] O. Tanir and S. Sevinc, Defining requirements for a standard simulation environment, IEEE Computer,us38]"r3A>em25297: Reference2+: [12] A. Symons and V. Lakshmi Narasimhan, Parsimmessage passing computer simulator, In738]#D4B>w32178: Reference1: [1] A. Averbuch, E. Gabber, B. Gordissky, and Y. Medan, A parallel FFT on a MIMD machine, Parallelt38]$5C>Ry31283: Reference2+: [4] M. Heidemann, D. Johnson, and C. S. Burrus, Gauss and the history of the FFT, IEEE Mag., 1, Oct/38]$6D>r{28048: Reference2+: [2] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series,>7E>r{21466: Reference2+: [11] A. Symons, V. L. Narasimhan, and K. Sterzl, Performance analysis of a parallel FFT algorithm on an38]&8F> {10383: Reference2+: [7] F. H. McMahon, The Livermore fortran kernels: A computer test of the numerical performance range,~38]':H?]J29522: Figure2+: Figure 2: Message-passing Parallel Computer Configurationen38].;I?\37413: Figure2+: Figure 3: ProcessorProcessor Interconnectiona Multistage Interconnection;38]/<J?t\37413: Figure2+: Figure 3: ProcessorProcessor Interconnectiona Multistage Interconnection;38]0e=K?a\37413: Figure2+: Figure 3: ProcessorProcessor Interconnectiona Multistage Interconnection;38]1E>L?h831923: Figure2+: Figure 4: Configuration Menu for Parsim38]2?M?524169: Figure2+: Figure 5: Graphical Output of Parsimlnd38]2@N=*39140: Table2+: Table 2: System Parametersri38]3AO?g38991: Figure2+: Figure 6: Comparison Between Actual FFT Performance and Simulated FFT Performance on as38]4BP?gl20962: Figure2+: Figure 7: Simulated Livermore Loop Performance on a Transputer Hypercube100,000 Iterations38]5CQ=u*39140: Table2+: Table 2: System Parameters]38]6DR?:p40347: Figure2+: Figure 8: Simulated Livermore Loop Performance on a Transputer HypercubeHigh Performance Links38]7ES?r`37575: Figure2+: Figure 9: 8 5 Transputer Mesh Using Step Routing for the First Livermore Loop38]8FT?n`37575: Figure2+: Figure 9: 8 5 Transputer Mesh Using Step Routing for the First Livermore Loop38]9TGU?ip12071: Figure2+: Figure 10: Link Communication Time of 8 5 Transputer Mesh for the First Livermore Loopa Step38]:XHV?p12071: Figure2+: Figure 10: Link Communication Time of 8 5 Transputer Mesh for the First Livermore Loopa Step38];iIW?]p12071: Figure2+: Figure 10: Link Communication Time of 8 5 Transputer Mesh for the First Livermore Loopa Step38]17668: Figure2+: Figure 11: Routing on a 5 5 Transputer Mesh38]>PLZ?r>17668: Figure2+: Figure 11: Routing on a 5 5 Transputer Meshpe38]?,M[?nV30462: Figure2+: Figure 12: Simulated Livermore Loop Performance on an IBM HPS Networkp 38]@jN\=s*39140: Table2+: Table 2: System Parameters F38]A2O]=8*39140: Table2+: Table 2: System Parametersrs38]B.P^?H20722: Figure2+: Figure 13: Topology Comparison for 4,00 Decision Nodes.38]CQ_>hy18072: Reference2+: [9] L. Ni et al, A survey of wormhole routing techniques in direct networks, Computer, pages 6276,Tpu38]DR`>r~23641: Reference2+: [10] J. K. Ousterhout, Tcl and the Tk toolkit, Addison-Wesley Professional Computing Series, Sydney, 1994.UN9 L<$lastpagenum><$daynum> <$monthname> <$year>"<$daynum>/<$monthnum>/<$shortyear>:<$daynum> <$monthname> <$year> <$hour>:<$minute00> <$ampm>"<$daynum>/<$monthnum>/<$shortyear><$daynum> <$monthname> <$year>"<$daynum>/<$monthnum>/<$shortyear> <$fullfilename>Tns
<$filename>]?,<$paratext[Author]>r+:<$paratext[Title]>
<$curpagenum>o!
CiTR Technical JournalVolume 2s"
<$marker2># (Continued)te$+ (Sheet <$tblsheetnum> of <$tblsheetcount>)l2:%Heading & Page <$paratext> on page<$pagenum>&Pagepage<$pagenum>yom'See Heading & Page%See <$paratext> on page<$pagenum>.7 R(Table & Page7Table<$paranumonly>, <$paratext>, on page<$pagenum>rt ) Table Allu7Table<$paranumonly>, <$paratext>, on page<$pagenum>f*Table Number & Pageh'Table<$paranumonly> on page<$pagenum>ePr+Page Numberi$Page <$curpagenum> of <$lastpagenum>,version number00.01da-document numberyTD0375-00-00da.TP Stds Report8"TD0375-20-01 Transaction Processing:Standards Report"/ct
CMISEwith-TP"0co
CMISEonly<$1TP Imple Report$="TD0375-20-03 Transaction Processing: Implementation Study"h2Phase Complete3OZ_ClearCase versionFront@@/main/13pa4Edition]15US_ClearCase versionN/A
6BigFillo /usr/local/frame/docs/BigFill.psum7Product Long$Telecommunication Service Management+8DT_SelectedParagraphTagsBody,Bullet,Dash,Definition1,Definition2+,Indent1,Indent2,Indent3,Note1,Note2+,NumList1,NumList2+,Req,ReqBullet,ReqBulletNew,ReqBulletOld,ReqNew,ReqOldpe9Confidentiality level Confidential:Product R&Drex;Document classCiTR R&D White Paper<Document number> _R4973.1x=Table<Table <$paranumonly>> Reference'
<$paranum>mo?FigureFigure <$paranumonly>a N@Section<<$paranumonly>>ARefs[<$paranumonly>]daB
Figure & Pagee(Figure<$paranumonly> on page<$pagenum>CsectionTsection <$paranum>RorD
Figure No.Figure <$paranumonly>oCMEHeading (Page)!<$paratext> (see page<$pagenum>)stiFHeading:
<$paranum>GReferenceNum <$paranum>eHNumonlyr<$paranumonly>ImultiReference<$paranumonly>JAppendix<$paranum>iilK
Num & Page(<$paranum> on page <$pagenum>)dLParatext<$paratext>eicMNumber<$paranumonly>m}t
m##n%%o''A2Rep**q,,r..Ais>Aomeuyyv{{w}}Au )b4i 'raj
rezpon{io|
[a$p[>ponk[rei$pY eZse[[_nopSource of PublicationNdp[1]e(P[a(s+[a)s~`[e2paX Table 1: $pUDNuNanlNuerTay> Tp!Toan"T& #Tron$Te%Tepa&TcNu'Tay>(Tm)N#*N+N%,N-bRe.c[2]/0c[3]1[2a2.43Nme4c[4]5N6N7N}8cu[5]9N'ra:Nre;Npon<c[6]=N>N?Nk@NAc[7]BNCNe[DNoENrPuFNdGc1[8]HNPINs[Js`Kc [9]LQa Table 2: UMUDuNNUDlNOUDrT"Pc>
[10]pQUDTRUDTSUDT&TwT[11]UVWXY+Zc[12]N%[\]^]0_c
[13]
`]5a[,b] cAbstract4dc[14]e]Nf]Ng]ch]'9ic[15]rej]ponk][6.l]Nm]N n]N"o]NpcKeywordsq$r]e[&s]ot]rPu0u]dv]1[8wN1x[s[.y]s`0z][92{]Q4|]}]2~]%["]
p<]U>]U@]UB]w+[]4]
]/0112J]4L]67
N]5P],:[ *]t4L]]X]]Z]]t][c[e]C[n]6]]]]q[]zi[
Figure 10: e[J[$eY[&o[[r[[0$i
Figure 4: -i[s
Figure 5: sR[06i[2
Figure 6: 4;i[
Figure 7: Fi[]
Figure 8: [Ti["
Figure 9: pW[U{aU4.1]i[Figure 11: [[][``5
]/j[0s[1x[2NJN4NLN6N7NNNPN:N*NLNNNXNZ[ta2.1c[eSC2.2][6][]eeee!e#eo$er%e&e)i+ei. 0e1e2e;:Se2.3i<[i: Di1.
Ep2.F{3.I[]iQPi
Figure 1: [Xi
Figure 2: ^[5e[]h[1.i2.n[$~[4[82.3.1:[NCNDNENZ[N_[NgNjNkNn[oa{[c[e6a2.5[ei
Figure 3: `3[ [
[ [ [ #[ -[ 0 1[3 :[[ F[ N[p Za3.1 \[] `[i me n q[[ ua3.2 w[( * - / 2 7 Z a[3.3C[ [Dd [Eh 1.i 2.n g3.p 4.s [Nw [[z a{ c~ e `4 [ [5 [ [ i [
[
[
[
[
-Y
8[
^[
d[
n[
xiFigure 12:
z[1
!bSFI\![#'SFI}"i8b*
]#%!8b*
X,XU
,X$$m8b*
^$!28b*
#[h _The Design and Application of ParsimA message PAssing computeR SIMulator1018b&a
_%#'!z8b&a
X,XU
,X&&n8b&a
`&!8b&a
i%[UU`
8bU
(Ga'%![8bU
(GX,XU
,X((o8bU
(Gb(!8bU
(G
re1'UTUTe\SFIm)*.}8b*
n*,)8b*
X,XU
,X++p* 8b*
o+)$8b*
2*h5CiTR Technical JournalVolume 2878b&a
p,*.)me8b&a
X,XU
,X--q8b&a
q-)a8b&a
8b,UU`b8bU
(Gr.,)8bU
(GX,XU
,Xr8b2Z/<8b2Z=,X`
KHead, Information Management Group, Information Technology Division, DSTO
`3Email: Lakshmi.Narasimhan@dsto.defence.gov.auSF8b
04?.8b
b@=
UTUThStandard Model
1?
o@=)e8b
23B8b
C>@`Joa8b
32nB8b
CD&a@`class MODEL{P Fݺ405?P Fݺa@?
UTUT`Level 0b.m e546?b.m er@?)
UTUT` Host Language.P Fݺ657?P Fݺ@@8b2Z
UTUT`!Level 1.m e768?ame.m eTe@@Divio
UTUT`",Model Specification. The model abstraction.stP Fݺ!879?P Fݺ!b@A
UTUT`#Level 2.m e!98:?.m e!!@A
UTUT $/Knowledge Management. How are the models inter URUT@$connected?8bP Fݺ:9;?@P Fݺ@B
UTUT`%Level 3.m e;:i?.m e-@B
UTUT`&System Design. Data gathering.?SFI(<m >>8bU
(G=<8bU
(G,Ho00Fݺ//>70 /ݺ/The Design and Application of ParsimA message Le"@/ePAssing computeR SIMulator
a=UUhU-Anthony Symons and V. Lakshmi Narasimhan
oXUU`c T
mXUUstZCurrently many interconnection networks and parallel algorithms exist for message passing dUULehcomputers. Users of these machines wish to determine which message passing computer is best for a given pUUe ljob, and how it will scale with the number of processors and algorithm size. This paper describes a general |UU fpurpose simulator for message passing multiprocessors (Parsim), which facilitates system modelling. A UUegstructured method for simulator design has been used which gives Parsim the ability to easily simulate (UUmgdifferent topologies and algorithm combinations. This is illustrated by applying Parsim to a number of HoUUialgorithms on a variety of topologies. Parsim is then used to predict the performance of the new IBM SP2 eUU@eBparallel computer, with topologies ranging up to 1024 processors.
UU`pra
hUUqUUXParallel Distributed Computing, Simulation, Hypercube, IBM SP2, Transport Optimisation, orʪUU@qss)Transputer Mesh, Performance Parameters.
.ܪUU`ch
sܪUU wDIEE Proceedings: Digital Technology, Vol. 144, No. 1, January 1997.
e `wwi
Introduction
h
UHUT xsodRecently we have seen the introduction of a number of large message passing computers, with the new ssUFUTxrobIBM SP2 being able to scale to hundreds of processors. Examples of other message passing parallel $UDUTxasmcomputers are Transputer Hypercube [8], Transputer mesh and cluster of workstations connected via an a1UBUT@xon
ethernet.
>U@UT yifOne problem facing the user is whether or not a particular parallel processing computer will suit the KU>UTe rapplication, i.e., what will the performance be, and would it be profitable to migrate the application to the new XU<UT@psystem.
eU:UT PahTo answer this question of predicting the performance of message passing computers, we see the need for ssrU8UT@ P/a simulation tool with the following features:
ܪU6UT`IE4separation of the algorithm and hardware simulation
nuU4UT`&easy re-configuration of the hardware
U2UT` w,ease of modifying various system parameters
rgU0UT`coFallow a study of various routing methods and interconnection networks
U.UT`roNprovide interaction between communication processes and computation processes
U,UT` H/capability to simulate thousands of processors
ofU*UT`ec"allow different processor speeds.
U(UT(wA number of simulation tools for parallel systems are available, such as Parallax [6], Pyrros [15] and U&UTiopHypertool [14]. Hypertool is a tool for scheduling a program on a hypercube. The program to be analysed U$UTrmust firstly be written (in C) before analysis is performed. This tool does not allow a flexible topology and the U"UTUTfneed for having a working program is seen as a disadvantage as we do not want to spend time writing a U UT@wa#program which may not be executed.
&eaUUT n ^Parallax improves on these two points by extending the topology from a hypercube to a general (UUTsthinterconnection network and the program to be scheduled is input as a task graph. However, Parallax was ee5UUTmdesigned primarily for comparing differing scheduling methods and heuristics. In a similar manner, Pyrros is TBUUT@w 6focused on providing more complex scheduling methods.
OUUT oomWhilst these tools do provide some of the features we desire in a simulation tool, not all of the desiderata \UUTjare provided such as the interaction between communication and computation processes, different processor iUUT frspeeds and the ability to simulate thousands of processors. Therefore, to satisfy the needs of a simulation tool, vUUT fdin this paper, we will address the design and implementation issues of our message PAssing computeR UTUUT@raSIMulatorParsim.
8bU
(G(><iro8bU
(GX,XU
,XA==srcuSFI?cneAArato8bU
(G@?er,8bU
(GUUTrimarily0jA dUTUT s jThe rest of the paper is organized as follows: Section 2 outlines the goals and implementation of Parsim, URUTUnsections 2.3 and 3 provides details of the topologies and algorithms simulated respectively, with the results !UPUT@Udiscussed in section 4.
id.UNUT`er:Section 5 provides pointers for future work in this area.
;ULUT`U
TX`dsParsim Design
iUJUT andIn the design of a simulator, we have the choice between an event driven simulator or a time driven pavUHUTssosimulator. Event driven simulators order the execution of the events in the simulation in an a priori fashion. 8bUFUT(pThis has the advantage in that only events are simulated, i.e., the simulator can jump to the next event in the UDUTne`queue. However, in the message passing computer, there are interactions between computation and UBUTkcommunication events which can affect both the duration and ordering of the events. To avoid this problem, SecU@UTe nwe use a time driven simulator. A global clock is used to step through each event, analogous to the processor U>UT@ulclock.
ectU<UT sumAs Parsim is a time driven simulator, the objects in the simulation each have a number of associated states. eU:UT.
aWhile in a processing simulation state, depending on whether or not the processor has a separate aU8UTe jcommunications processor, not all of the processor's power may be available for computation. For example, U6UT tleach active link on the transputer generates a 5% load on the processor, whereas for many workstations with anU4UT@ev^no specific communications processor, the computation and communication cannot be overlapped.
U2UT inmTo reflect these possibilities, each simulated processor has a value indicating the percentage of processing tU0UTanzpower that is available for that particular clock step, i.e., proc_avail. Each link attached to the processor may U.UT@ cD@@sodeSFIBhefDDfoow8bU
(GCBmts8bU
(GI?, ich defi2Ds UTUT`Level 1: Model Specification
eURUT ct`Level 1 of the reference model is the model specification, which defines how the algorithms and Le$UPUT@Hgelapplications are simulated and specified. Tanir abstracts the model specification as an object such as:
d NUT t`We generalise the algorithms to be simulated as being composed of a group of phases executed in igLUTtunsequential order. Each phase is subdivided into a set of like tasks. For example, in the one dimensional Fast JUTpppFourier Transform (FFT) [1], this algorithm consists of two phases namely the row FFTs and column FFTs. n HUT
jThe first phase can be broken down into a number of single row FFT tasks, and similarly, the second phase FUTican be broken down into a number of single column FFT tasks. The execution of the phases is based on the bDUT@H, following pseudo code.: 6
fihBUT qIn Parsim, we provide a C++ class, specific, which extends the MODEL class to allow the algorithms to be u@UT@n,:simulated. This class consists of the following elements:
>UT` sComputation
pe<UT`rastart-up computation
t:UT`local processing
T8UT`enexecution time for each task
m6UT`mpend computation
pÕ4UT`!igCommunication
Е2UT`#rd*message size sent to the slave processors
ݕ0UT`$mp)message size sent to the host processors
.UT`%Trreceive communication
,UT`&cosend communication
nam*UT`)ndSynchronisation
H(UT`+Thsynchronisation between phases
int&UT`.leAlgorithm Data Structures
+$UT`0e number of tasks per phase
8"UT`1a number of phases
lE UT`2 especific data structure.
a8bU
(GDB eu8bU
(GX,XU
,XAGCCs C+SFIEeODGGrims8bU
(GFEicl8bU
(GWll`^^GioUTUT`:st-Level 2: Hardware and Topology Specification
URUT <UTiLevel 2 of the standard model provides the model interconnections. In this section, we will consider the o$UPUT<UTiinterconnection of processors, communication nodes, links and the message routing. To allow flexibility, 1UNUT<UTmparameters specifying the hardware configuration to be simulated are read in via input files.The parameterss>ULUT@<(required to model the hardware include:
ntKUJUT`Dle!Link latency and start up times.
T\UHUT`Eer+Speeds of each CPU relative to a base CPU.
bermUFUT`FUT2Topology interconnection and routing information.
~UDUT IgThe first two parameters can simply be read in as a file of floating point numbers which can be easily EUBUTIODdchanged to reflect faster hardware and processing speed. However, more emphasis is necessary on the llU@UT"HISrepresentation of the interconnection network and message routing information.
ogyTBUThQ4Message-passing Parallel Computer Configuration
es8bU
(GGEnwe8bU
(GX,XU
,XDJFFsproSFIH ssJJibit8bU
(GIHyg 8bU
(G/tiead in vccktJ.UTUT(qThe general processor interconnection is shown in Figure 1, where a number of processors are attached to TURUTdsjan interconnection network. For multi-stage interconnection networks, such as the IBM SP2, processors are !UPUTIoonly connected to communication nodes (termed nodes). An example of these configurations is Figure 2a, UB.UNUT Hch/where the ICN consists of nodes and links.
peeUT(XmpOProcessorProcessor Interconnectiona Multi-stage Interconnection; ectUT@Xss8b Processor Only Interaction; c Added Notes
MeUT llgHowever, some topologies have processors only connected to other processors, such as the hypercube and UTJjmesh topologies. In this case, the ICN consists only of links to other processors, an example is shown in }UTHFigure 2b.
{UT ^bTo provide homogeneity between these two interconnection methods, we transform the processor only yUT^n btopologies into processor node topologies by inserting a communication node between the processor-wUTH^muGprocessor connection. This modification is shown in Figure 2c.
!UPuUT eoncThis model with inserted communication nodes has the following advantages over processor-processor 2sUT@eonly interconnection:
qUT`han;Allows a consistent model of processor - node connections.
InoUT iafAllows unidirectional and bi-directional communication to be regulated, e.g., using a node to provide *mUTiUTathe ethernet interconnection allows the constraint of only one processor sending a message to be t7kUT@i applied.
HiUT nol]The topology information is broken into two filesone for the processors and another for the UgUTn~communication nodes. If we have P processors and N nodes, then the processor information file will consist of beUT@Hny$P of the following entries:
es8bU
(GJHaon8bU
(GX,XU
,XGMIIsonnSFIKireMMeon8bU
(GLKd h8bU
(Gsta""ssor 2uaMe"UTUT@h~:
WThe node information file will consist of N entries of the following form:
UTURUT wsiAs a node can be connected both to processors and other nodes, we represent the id for the processors as UPUT@rnRthe numbers 0... P - 1 and the node id as the numbers - 1... - N.
ȪUU`iRoute Information
UMUT ThlEach of the processor and node entries in the configuration input files contain route information necessary UKUTdeito route messages from one processor/node to the destination processor. The routing for intermediate UIUTyqprocessors/nodes consists of selecting the correct send link, and then routing the message. The selection of the UGUT@,XScorrect link to route the message can be achieved via a number of methods such as:
UEUT`&dynamic route determination algorithm
UCUT`ta%static route determination algorithm
a&UAUT`UTstatic route table.
no3U?UT e bRoute determination algorithms required in the first two methods may be dependent on the topology @U=UTtoesimulated, and would need to be hand coded into the simulator, thus reducing Parsim's flexibility of rMU;UT@dsimulating different routing strategies. To overcome this limitation, a static route table is used.
ZU9UT thEach processor/node i has a route table which consists of a vector (P entries) of link sets. At position j in gU7UTe the vector, the link set is the set of links that i may use to communicate with processor j. The link sets are cortU5UTd represented by an integer. A link set is created by numbering the links 0...MAX LINKS and setting bit k = 1 erU3UT@s:fiff link k can be used for processor/node i to route messages to processor j.
U1UT
ajFor topologies such as the hypercube and mesh, each link set will have at most one non-zero bit. However, U/UT@hohthis is not the case in general. This implies that there can be a choice of links for routing purposes.
thU-UT rekFor large numbers of processors, it can be seen that it would be impractical to generate the processor and is U+UTicnnode information files manually. To aid in the development of new topologies, a number of tools are provided, U)UT anamely, Linkup, Cube, Meshx, Meshy, Meshstep, Ethernet and Frame which help develop the node and sU'UT@inprocessor configuration files.
uniU%UT r `LinkupAllows the user to specify directly the processor to processor connections and routings, cU#UT@g ULUT@cl)between other algorithms and topologies.
rKUJUT`e OProcessor utilisationthis indicates the efficiency of the parallel algorithm.
hatXUHUT d aCommunication bottlenecksthe simulator can be used to pinpoint the presence of bottlenecks, and eUFUT@[9Ocompare the effects of new routing methods or algorithms on the communication.
UDUT`Level 4: Application GUI
UBUT(tTo aid the developer, a Graphical User Interface (GUI) based on Tcl/tk 5[10] 4 has been added to the front end
U@UT@Vof Parsim. Screen dumps of Parsim are provided which outline the following functions.
U>UT huThe algorithm to be simulated and the directories for the configuration files are selected in Figure 3 .
TIUTh"Configuration Menu for Parsim
I UT h:
HAn example graphical output of Parsim is shown in Figure 4 .
duNUThed Graphical Output of Parsim
8bU
(GPNlie8bU
(GX,XU
,XMSOOss tSFIQtUHSSatn 8bU
(GRQnin8bU
(G8le @[9cbbSef `dsTest Algorithms Analysed
nUTUT UTlThree test algorithms are analysed using the simulator, a parallel FFT, a set of Livermore Loop Kernels and (%URUT@tk$the Transport Optimisation Problem.
on2UPUT dThese algorithms can be parallelised in a variety of ways and whose suitability for message passing UT?UNUTalhcomputers will also vary. Let us consider a typical algorithm on a message passing computer in order to LULUT@Pdetermine features of the algorithm that are favourable for parallel execution.
oYUJUT` s
Assumptions:
ufUHUTh NYThe data size is represented by x and the number of processors by P.
sUFUT hThe data originally resides on one processor, termed the host, and the results must also reside on this UDUT@ atprocessor.
UBUT h
kWe use an equal distribution of data over the processors, i.e., each processor receives data of size .
`U@UT hmdEach processor will communicate a set of messages with the host and the total communication time of FTU>UT rezthese sets of messages not overlapped by computation be as in equation 1, where C0 is the fixed start up time U<UT H a Kand C is the communication time proportional to the data size.
alݭUT lsjAs there are P - 1 processors communicating with the host, the total communication time from 1 is UT H estherefore equation 2.
UT h ll0Define the execution time as in equation 3.
u=_GUT #NkAs the parallel algorithm must do the work of the serial algorithm, the time to compute just the algorithm J_EUT #llron P processors is therefore . Thus combining equations 2 and 3 gives the total execution time of the W_CUT H #6parallel algorithm T(P) as in equation 4.
UT h -eahWith termed the scaled reduction in work, the following theorems have been proved in [12].
m
FUU` 0os
Theorem 1
UT 1tigThe scaled reduction in work using the parallel algorithm must be greater than the total communication herUT@ 1s 0time in order to obtain a speedup of k.
AUU`mu
Theorem 2
UT :thqIf the execution time of any parallel algorithm is denoted as the function X(x), and the inverse of this tUT H :imfunction is denoted X(x), then for large P, the minimum size of data x to achieve a speedup of k is given by:
%eUU`Jle
Theorem 3
5UT FQuFor small values of C0, the minimum size of data x to achieve a speedup of k is given by oBUT H Finy where the complexity of the algorithm can be given as X(x) = XN xN and N > 1.
aOUT NUTyLet us also consider the special case of the algorithm complexity being linear with x, i.e., X(x) = X'x. F\UT H NTh3Substitution into equation 4 gives equation 6.
wo8bU
(GSQar 8bU
(GX,XU
,XPVRRs orSFITAVVUT8bU
(GUTpal8bU
(Ged""), a!V o"UTUT` ZimFFT
tiURUT( \X(tThe idea behind the Fast Fourier Transform (FFT) calculation may be attributed to [4], but the first widely
$UPUT \leqknown FFT algorithm is that of Cooley and Tukey [2]. It features in many applications ranging from image 1UNUT@ \"processing to speech recognition.
>ULUT `omdReal time processing, however, is severely constrained by the speed of general purpose uniprocessor
aKUJUT `UTdsystems and, as a consequence, proliferation of dedicated DSP hardware has now been observed. These XUHUT ` Fmdedicated systems still perform sequential calculations and this has caused the investigation of a number of eUFUT@ `
parallel versions of the FFT.
rUDUT ahIn order to determine the performance of the FFT executing on various transputer topologies, a parallel UBUTapnFFT algorithm based on a two stage one dimensional transform ([11]) is simulated by Parsim. The input U@UTaX(dsequence is mapped onto a two dimensional matrix of R rows and C columns. Essentially the algorithm heU>UT@aUPconsists of two stages:
T U<UT` mQ1Stage 1Apply the FFT over the rows of the data.
aU:UT` ng 4Stage 2Apply the FFT over the columns of the data.
rU8UT qUTmIt is important to note that the time complexities of the row and column FFTs for equal distribution of data oU6UT H q ` are and respectively.
e,U4UT` udeLivermore Loop Kernels
nowU2UT( wesoThe Livermore loop kernels are a common way to measure the performance of parallel systems ([7]). Only gatU0UT w ga subset of the kernels which are applicable to the message passing networks have been simulated using rmaU.UT wcunParsim. Each of the loops is blocked, i.e., the data is divided into contiguous blocks, which are distributed U,UT wfokover the network so that each processor receives an equal amount. The block size is given by the number of sio-U*UT wwsliterations divided by the number of processors. These kernels are characterized for a 30Mhz T805 Transputer Q:U(UTH we Eand a Sun Sparcstation-2 by the parameters shown in Table 2 .
GU&UT` ns5Kernel number 1 is a fragment of hydrodynamics code.
oTU$UT` e IKernel number 2 is a fragment of incomplete Cholesky-conjugate gradient.
6aU"UT` ar&Kernel number 3 is the Inner Product.
nU UT` rm1Kernel number 7 is an Equation of state fragment
m{UUT` reIKernel number 9 is a code fragment calculating the integrate predictors.
UUT@h 6Kernel number 12 calculates the first difference.
vlUT` inTransport Optimisation
latjUT UTiThe problem of scheduling transportation routes and vehicles optimally is gaining more and more interest hUT U,kdue to the large amounts of capital invested in the vehicles and the payroll of the drivers. To solve this y tfUT U*atransport optimisation problem, integer programming and dual simplex methods have been proposed. odUT iHowever, solutions to practical problems involving 1400 variables can take a few hours on a workstation. T8bU
(GVT h8bU
(GX,XU
,XSYUUsa fSFIWaenYYKeel8bU
(GXW 8bU
(G)isagment
miiY
UTUT ccTherefore much interest has been displayed in whether the optimisation problem can be parallelised thURUT@ .
effectively.
!UPUT t gTo answer this question, it is proposed that the optimisation algorithm be simulated using Parsim. The lly.UNUT@ nd&algorithm can be outlined as follows:
;ULUT` un2Chose a constraint branch from the initial basis.
LUJUTh T\Construct an infeasible basis for each of the 0 and 1 branches, termed decision nodes.
du]UHUT` hAApply Dual simplex to each node to generate two feasible bases.
cnUFUT` ol+Repeat until an integer solution is found.
woUDUT
kInitially, there is one basis, but with each application of the Dual Simplex algorithm, two more nodes are UBUT@ produced in the decision tree.
U@UT bjThe statistics collected for the execution of a Transportation Optimisation problem on a Sun Sparcstation U>UT@ UTare the following:
refU<UT` ha.The number of variables in the basis is 1421.
U:UT eljThe time to generate a successful sub-goal solution by the dual simplex iteration has been found to be a U8UT@ n 0random variable with a range of 0 to 4 minutes.
UTU6UT` ribThe number of decision nodes to be evaluated before reaching the optimal integer solution is 40.
` Simulation Results
feaU4UT( chlThe actual and simulated performance of the parallel FFT are shown in
Figure 5 . It is seen that the toU2UT esimulated performance is in close agreement with the actual performance. The variation is due to the n"U0UT onnsimulator clock starting at zero, where as on the transputer, there is some bookkeeping overheads which meant /U.UT@ si,that the clock does not start at time zero.
cs__be8b9" mlrH8b9" bIKegR@`7±b!8b
n3oB8b
fCE^@`)«Q
operations8bJ
onpBgd8bJ
QCF%@`*data structures8b
poqBX8b
CG@`+oc}P8b
qpB^P8b
CHicUT@`,Le 8b" rmsH8b" ILC@`9($on8b" srtHl8b" IM @`:y"8by" tsHby"8by" JIN@`;cer8bl
uvK8bl
LOIK@`3@78bu vuwKb
8bu oLP@`=
8b wvKb8b LQg@`>!SFILxy}Xb8b*
My{x@8b*
X,XU
,XzzuP8b*
Nzx8b*
"HyhC=100Anthony Symons and V. Lakshmi Narasimhann8b&a
O{y}xl8b&a
X,XU
,X||v8b}UTUTebb8bU
(Gs)@8bU
(G
enSF.UTUTe
}FݺfU3R~"MFݺfU3RFݺ"Fݺ"Footnote
FݺqF@" FݺqF@FݺyFݺySingle Line
Fݺ'^j,X"Footnoteu rau&a8b
K8b
LS ra@`@a...8b K8b LT@`Bin8b K8b LUblein@`C$8bJ K8bJ LV@`D+"8b
Kile8b
LW@`E...8b KV8b LX@`FTaF8b͊ K8b͊ LYFݺ^@`Heo8bJ ]K8bJ LZ<@`Ir_,8b
T8b
bU[ UTUThLra System Parametersw
Tw
U[TeMBnj
Tj
U[UleeNC
TbJ
U[eOP`
TnteP`
ssU["beQۆ
Tۆ
@U[..eR85
T85
U[TaFeS8 >\%T8 >\%%U\eo\%T<w0 >\%%bU\
UTUT WFlops URUT@W
Iterationj Mz%Tretj Mz%%U\UTUT X
Mflop Single URUT@XTransputerB Mz%T Mz%%U\[UUTUT YN
Mflop Single URUT@YWorkstationbP` Mz%TP` Mz%`U\
SUT`[Iterations ۥ >\%TQۥ >\%%U\UTUT \ۆBlocks
URUT@\.. Received8T >\%T8T >\%% U\UTUT ]Blocks eSURUT@]\Sent8$ >\T8$ >\UTU] V
UTUT`^UT1w0$ >\ Tw0$ >\0U]%
UTUT``5j$ MzTItj$ MzU]D:
D:`j0.943 106\;Mz+'TUR;MzspU^ D:
D:`b1.315 106 P`$ MzTYP`$ MzURU]@Y
UTUT`e`10,000%ۥ$ >\Tۥ$ >\U][
UTUT`f 18T$ >\TQۥ8T$ >\U]UT
UTUT`gBl18;>\T\8;>\U^8T
UTUT`h 2w0;>\TSURw0;>\U^>\
UTUT`j4jSMzTVjSMz1U_>\D:
D:`k0.724 106SMz-T5SMzU_TD:
D:`lz0.355 106P`SMzT\P`SMz+U_T
UTUT`mz10,000spۥ;>\'T ۥ;>\ U^
UTUT`n18T;>\!TTUT8T;>\U^>\
UTUT`o28S>\#T[8S>\1U_>\
UTUT`r3w0S>\%TUTw0S>\1U_>\
UTUT`s2jkMzT8TjkMz2U`>\D:
D:`t1.548 106kMz/T4kMzU`TD:
D:`uz2.922 106P`kMzTP`kMz-U`T
UTUT`vz10,000ۥS>\-T ۥS>\U_
UTUT`y08TS>\/TTUT8TS>\spU_>\
UTUT`z28k>\1T8k>\1U`>\
UTUT`{7w0k>\3Tw0k>\2U`>\
UTUT`|16jcMzTUTjcMz0UaD:
D:`}1.416 106cMz1TjcMzUaD:
D:`~2.148 106D:P`cMzTP`cMzUa
UTUT`10,000Uۥk>\;Tۥk>\zU`
UTUT`18Tk>\=T8Tk>\ۥU`
UTUT`38c>\?TUT8c>\8TUa
UTUT`9w0c>\ATUTw0c>\Ua
UTUT`17jGMzTTUTjGMzUb3D:
D:`0.357 1062GMz3TjGMzUbD:
D:`0.574 106D:P`GMz TP`GMzUb
UTUT`10,000Uۥc>\ITۥc>\zUa
UTUT`18Tc>\KTT8Tc>\ۥUa
UTUT`138G>\MTTUT8G>\Ub=
UTUT`128Tw0G>\OTTw0G>\\Ub?
UTUT`UT1$ Mz)TT$ Mz\U]AD:
D:`UT1.855 1068b2ZKWG8b2ZXTUT Any variable in the basis which is neither 0 nor 1 can be forced to a 0 or 1, thus moving the basis closer to an integer solu@tion.j;Mz
T:`j;MzU^D:
D:`1.315 106UTۥG>\WTۥG>\Ubc\
UTUT`a18TG>\YT8TG>\TUbc\
UTUT`a1 Jt ed Jt e^^(1)>\x rL^$_^x rL^$xxT(P) = (P-1) (C0 + C\- Jď$dUT- Jď$-^-^/Message Set Communication Time = C0 + CAQ}-_d0Q}-_SSQ}SxQ6-_dTQ6-_SSQ6SPiQ6-_deorQ6-_ Jt e&hl Jt e^^(2)--_ˇd--_ˇGh0GI*G*xG%hbcG%I*G%*PG%hG\G%L ! ĚhteL ! ĚLL)8blUXAhQLR Jt eB( = Jt e^^(3)- Jt43 ĚC$-- Jt43 Ě-^-^ Computation Time = X(x)+ jT^Q0RS}}-U}- Ck))X(x)S V Ck))Pt}-Wt}-Q-_^Q-_Q,SQSxM-_^M-_Q,SMSPM-_^M-_< J
ֆ Ě^< J
ֆ Ě<^<^) +G% UhG% Uh
Bn$hh
Bn$hh. Total Communication = (P - 1) (C0 + CM-_^3M-_#-_}-^-#-_}-+eS#SX(x)+ (-_^0(-_+eS(SP#-_}-^#-_}-x-_8^x-_8`}-!))P - k7 ~`7 ~!)7)kP`ի9h`ի9hիիX(x) (*9;`*9;**)^8bGlX`QR Jt e Jt e^^(5)tal8b4ZZQM-_{RkJbkJb!!,x > X-1 (kPC0 ~_ ~h
P - 1
8o-
8oh
8oP - k ~ ~pkvP pkvp!p!)kfkf*< *< /;)*<)kC) ^V
r) ^V
r/;)))XN)^Vr)^Vrի"kAVRի"kAVիիx > N - 1 'Z-Z(5'Z-Z'ZU
.Y.Y'Z.YX Aի XX AիkX A.YXV!_XV8hb-
"8hb-
8hb8hb 8bM(PEQR@f~'@f~It)@f)P - 1!@ (@ It)@)P - k@f~)@f~ft eFft e(6)XI 79!T Uի
Ě
Ěx > 1K3n$
K3n$1
K
kPC0 YS"
.YS"ɮGSGX' - kCի4-U
ի4-U A.Y4}L_t
V}L_t}}P - 1}L7t
}L7t}$}$P - k}
f}'}}L_t 1!}L_t8
Q
.t8
Q
.8_8_3nbM Ě3nbM Ě
P - 1LĚ Ě(6LĚ ĚP - kff frH
ĚKH
ĚHH,)m3nbM Ě)m3nbM Ě)m
)m
P - 1)mLĚ ĚX')mLĚ Ě)m)mP - k
'"'"t'">'"3n'"3nɍ
u~ Ě7ɍ
u~ ĚɍɍX' - kCD{u
ĚD{u
ĚD{uD{u> 03nlt3nl.8A))RC A))PĚի9
ի9
իիR (r299x 299x2929log C)9!T"&Um"#!#$A))RC #"$!"$ A))P$#%!"#ի9
%$&!ի9
իիR (ɍ299(r&%!299(r2929log R)P`;Mz'TP`;MzU^
UTUT`10,000vA
.([)Z)0)vA
.hLk NlI)(*Z(+0IhMR63*)+Z/1636?)6?)>nAX+*,Z)0nAXnAǋAI,+-Z-/IB&tEI~-,.Z,./tEI~tEFI7AV.-/Z-/7AV7A
tA7IV/.0Z*017IV,.nvX+0/1Z/1nvX+(+nv3+10Z29nv3+*0 Jt e24\9 Jt e^^(8)kj:3\>_;=`.k
Ě#425\5:).k
Ě.k.kx > {b=pk$$546\46:{b=pk$ϗ
{b
PC0 hЧs:%657\57:Чs:6Чs^V1 Ě&768\68:V1 ĚϗX' - C4X
Ě'879\79:-4X
Ě4X4X,Y
+ Ě(98:\8:.Y
+ ĚYY
X' - C > 0.k=pg):9\-/.k=pg49~;;<3<=~ Ck))FI <<;=3;=/1 Ck))MR==<3v;<"G>3C_t?B`(8oAH?@>@AoA/ )oA)B/=rEI@?A>?Ak/=rE/ )/)Lk Nl$/rrJA@B>/rr?@իAXPBA>:իAXիA.YA8b?*?WC>_:"юFD`3DIYC93Zm]`7lsEtZ]:4c q ĚFGCIY' C4c q Ě4cc4cc> 0-3~GFHCHT-3~m;Ya-YaFI
3HGICGJT
3m;YaYaMRٳ
;IHJCFQYٳ
;ٳEٳE-rJIKCHTr?B`rr23KJLCLS23Ya2YaB@A
.LKMCKMSk@A
.Ya@YaLk NlNZr.YMLNCLSZr.Yr@Zr
r@A
.NMOCOR@A
.pC@pCLk Nl>Z.YONPCNPRZ.YZ
rd POQCORrd $Frd$FMR4cl ĚQPRCIRY4cl Ě4c244c24>cZ.YRQSCQSYZ.YNPZ3.Y[SRTCRTYIZ3.Y[KM3TSUCSXY;Ya3GJ`UTVCVXٳ` x`xB~VUWCUWX~ #x#xFIVWVXCVX2VYa2VXWYCTY@VUW3Wߵ YXDCY3Wߵ FXk^2LF-
tZ[E[\
k^2LF-
)k)N (N + 1)
.u u[Z\EZ\u )u)2k^2LF[v\[]Ed k^2LF[Z[իA3w]\ErdիA3lիA5۱A8b ^E44c8xxFSFI_PaaT8bU
(G `_8bU
(GlSGJjCaUT hZ Simulated Livermore Loop Performance on a Transputer Hypercube100,000 Iterations
UU` Corollary 1
镳UT mTo achieve a speedup for any particular Livermore Loop, equation 7 must be satisfied where Lk is the YUT Vutransfer rate of communication links in floating point numbers per second, Nl is the number of communication UT -
links active per processor, FI is the floating point operations per iteration, B is the total number of blocks sent/UT H [Ureceived per processor and MR is the MFlop rating of a single processor.
GCUU` AProof:
V[UT h AYUsing equation 6 and noting that the speedup factor k is 1 gives equation 8.
G
UT
As we let C0 0, the necessary condition is X' - C > 0. X' is the rate of work of the Livermore Loop which GUT
scan be calculated as . Similarly, C is the communication rate, which can be calculated as the ratio of GUT
kfloating point numbers communicated to the speed of the communication links, i.e., . Substituting into GUT H
equation 8 gives:
qAUT
caoThis condition can be interpreted as to increase the overall speedup, the physical machine's ratio of transfer liq?UT
ceerate to processing rate must be greater than the algorithm's per iteration ratio of communication to s
q=UT@
computation.
vq;UT
ndbThe effect of this condition can be examined as follows. Let the number of active links be 1. The 'q9UT
g transputer's link transfer rate is 0.446 106 float/s, and the computation rate is approximately 1 Mflop.4q7UT
pTherefore the ratio of Blocks sent/received to Floating point operations per iteration must be less than 0.446. Aq5UTH
QFrom Table 2 it is seen that this is only true for loop numbers 1 and 7.
cNq3UT
s iBy increasing the transfer rate of the links ten-fold (which may reflect the use of faster communication [q1UT
Sqprocessors), the ratio of Blocks sent/received to Floating point operations per iteration for a speedup with two ehq/UT
tvprocessors is increased to 4.46. The simulation of the high performance links is shown in Figure 7 . As can be 8bU
(Ga_t o8bU
(GX,XU
,XYh``sUT8b
B9cHO bLuyI t n8b deN
]||Oaner8b"edNloz&~~Ora iSFI%fhhlos 8bU
(G&gfp i8bU
(GthH
nnhleUTUT
s dseen, all loops now provide a speedup with two processors, as predicted by the higher communication inURUT"H
mcomputation ratio.
terCUThq1Z Simulated Livermore Loop Performance on a Transputer HypercubeHigh Performance Links
8bU
(G(hf/UT8bU
(GX,XU
,Xamggs of8b +iWgunX
8bU
@bjZ_`SFI2kmm8bU
(G3lk8bU
(Ger8boom
UTUT
-logThe following five figures show the system performance of a 8 by 5 transputer mesh executing the first loURUT
-(Gulivermore loop. Figure 8 a shows the total time spent in calculations by the processors. This shows that each T!UPUT
-, jprocessor performs the same amount of computation due to the equal distribution of data over the network. .UNUT
-iou Figure 8 b shows the amount of time spent by the links on each processor while being blocked. As there are no c;ULUT
-lpeaks and the middle is flat, it is concluded that no particular processor is blocked from communicating an HUJUT H
-excessive amount.
!UTh
R 8 5 Transputer Mesh Using Step Routing for the First Livermore Loop
8bU
(G5mk8bU
(GX,XU
,Xhslls8bm
9nfEgo8b
Dokoow+&lsyem8bU
MNpqetir
-SFIPqwthssulio8bU
(GQrqTUP8bU
(Gesount of ppse T;UT (ofm Link Communication Time of 8 5 Transputer Mesh for the First Livermore Loopa Step ea9UT@e NRouting; b Row First (X) Routing; c Column First (Y) Routing.
u7UT(
8d o Figure 9 (parts a, b and c) compares the link communication time for the three mesh routings. The step UT5UT
88 vrouting ( Figure 9 a) spreads the communication more evenly than the x routing ( Figure 9 b) or y routing 3UT
8(Gx( Figure 9 c). In both the x and y routings, most communication is along the first row or column, thus creating a o1UT@
8bottleneck.
/UT(pTo see why this is the case, consider the row routing shown in #Figure 10 "a and the step routing shown in -UTw %Figure 10 $b. It is seen that for the row route case, most of the processors communicate via the one link on the es8bU
(GSsqe 8bU
(GX,XU
,Xmrrs of8bz
ytEutivQ e98b ~utr (@4Umnirl,X
x^
irel,X
,rBitmapsicn.tif0001FRAMTIFFUNIX romy]x]`yzc
Fmy]x]`,nBitmapsbase1.tif0001FRAMTIFFUNIX ϩzpHp0zy{c
9ϩzpHp0,iBitmapsinsertn.tif0001FRAMTIFFUNIX mny,X{z}c
ky,X,yBitmapsmsin.tifh0001FRAMTIFFUNIX e@,X@X|d
-e@,X@X,Bitmapsconfig.tifo0001FRAMTIFFUNIX th6 Ě}{csb6 ĚQPPaw,X\~e
w,X\Bitmapsgraph.tif0001FRAMTIFFUNIX 1 Ě}c1 Ě98j98jbr9h1M Ěcire9h1M Ě;98j9h98jcQ,X>zo
oQ,X>z,Bitmapsusage.tif0001FRAMTIFFUNIX SFI 8bU
(G,8bU
(GBitmapsblocked.tif0001FRAMTIFFUNIX .Q
!>zHp
athQ
!>zH,sBitmapslinkx.tif0001FRAMTIFFUNIX Qw>zp
dpeQw>z,aBitmapslinky.tif0001FRAMTIFFUNIX SFIopr tre8bU
(GT8bU
(G'th## executi c#UTUT
nVfsimulated are the mesh, hypercube and SP2 MIN using transputer communication link values. This figure URUT"H
nghoshows that there is little variation in the performances of each topology, with a maximum variation of 3%.
.UTh
x, 1 -Topology Comparison for 4,00 Decision Nodes.
bb`,XConclusions
UT
zeIn this paper, we have introduced a tool to help predict the performance of message passing parallel cUT
zFRosystems. This simulator facilitates re-configuration in order to allow a variety of tasks and topologies to be c>B&UT
zgsimulated. A high degree of simulator accuracy is shown in the comparison between actual and simulated a3UT
zzp
86Q,X>z,.Bitmapsstep.tifa0001FRAMTIFFUNIX ,,X=hzn
UT,,X=hzoBitmapshcfast.tif
0001FRAMTIFFUNIX `3UHtf(^Ht
cpu`3UHtf(^H,rBitmapsrouterow.tif0001FRAMTIFFUNIX
J,Xtf(bf(t
.io
J,Xtf(bf(,tBitmapsroutstep.tifp0001FRAMTIFFUNIX SFIKsueye@,X@`i
eore@,X@`,Bitmapscompare.tif0001FRAMTIFFUNIX yl,X=hpj
tlkyl,X=hpoBitmapshc.tify0001FRAMTIFFUNIX (G8bm
h,XF\,
'p
u
,
'p
Bitmapspi.tifc0001FRAMTIFFUNIX FFw,Xz
w,XzBitmapstop1024.tif0001FRAMTIFFUNIX UNSFI8bU
(G018bU
(GioUTUT(T,gA. Symons, V. L. Narasimhan, and K. Sterzl, Performance analysis of a parallel FFT algorithm on a URUT@TLtransputer network, Parallel Algorithms and Applications, 4, 1994.
@!UPUT(ZYA. Symons and V. Lakshmi Narasimhan, Parsimmessage passing computer simulator, In .UNUT@ZlkVProceedings of the First ICA3PP-95 Conference, pages 621630, April 1995.
;ULUT(_hrO. Tanir and S. Sevinc, Defining requirements for a standard simulation environment, IEEE Computer, HUJUT@_27(2):2834, February 1994.
i.UUHUT(dUNdM. Y. Wu and D. D. Gajski, Hypertool: A programming aid for message-passing systems, IEEE bUFUT@d24Otransactions on parallel and distributed systems, 1(3):101119, July 1990.
oUDUT(ifT. Yang and A. Gersoulis, Pyrros: Static task scheduling and code generation for message-passing |UBUTilmultiprocessors, In Proceedings 6th ACM international conference on supercomputing, pages 428443, aU@UT@iriNew York, 1992. ACM Press.8bU
(Gh a8bU
(GX,XU
,XsSym8bU
(G p8bU
(GtolkVPrhe1\ Ěo6, 1\ Ě1a.1D Ěosor1D Ě1b1\ Ěp7):1\ Ě1ad1wɿ Ěptl:1wɿ ĚM1Mbs5M Ěprsa5M Ě=*=*cev= ĚtTv= Ěf})})a Dv= ĚtnatDv= ĚG}})D})bi8bU
(G6 A8bU
(GX,XU
,Xs4438b) qBCPr8b) (GC 6c@`±b8b3J B8b3J (GC 6e@`±bfor each phase{to8b=
Bhe8b=
ĚC 6f@`1- start up computation on host8bF B8bF C 6g\@`( synchronise phase start18bP B8bP C 6hM1@`§ for each task {r8bZJ Be8bZJ C 6iT@`Ě+ send communication8bd
B)8bd
C 6j@`6) local processing8bm B8bm PrC 6k@`. receive communication8bw B8bw C 6l`b@`to }8bJ B8bJ C 6m`1@`st& synchronise phase end8b
B8b
\C 6n@@` . phase end computation on host8b B8b 1C 6o@@`
}8b B8b C 6pC 6i@`@SFIx cLeftonSFI!RightBSFI" ReferenceSFI<6First SFI?sibSFIBSFIE 6SFIH SFIKivunSFINSFIQSFIT@SFI W}JSFI
_SFIfmSFIkst SFI
qe bSFISFI 6SFI SFImpn SFI)FirstMj++fMT!Table BoldT:\tTable : \t. fN b.Ŋ<F(OFi]tPid)x۴>y>0Gf^Ɏv,SvTӷhCode. fOT
TableTitleT:Table : . fPFaFigure1F:Figure : Body. fQT!Table2+T:Table : . qbfRT jL. qb. TableTagT:\tTable .\t. XfSHQ X. 2HeadH:.\tBody. fTCellBody. fUT
TableTitleT:Table : . fVCellHeading. fWCellBody. fXT!Table1T:Table : . ,XfYBody. Z
CellBodyLe. ,Xf[Body. ,Xf\,X Bulleted\t. f] CellBodyCe. f^ Signature1. e X,Xf_
X. AbstractBullet\t. X,Xf`HQ ,X. 1HeadH:\tBody. XfaHQ X. 2HeadH:.\tBody. @b
U
f & 9 T(L 2_ Body. BfcRAB. Reference2+Bold
R:[]\tReference2+ . BfdRAB.
Reference1BoldR:[]\tlReference2+\. * Xfe* .X DashBold\t:. ffCellBody.
fg@
AuthorAbstractHead. @h . . HeaderFirst. fiFAFigure2+F:Figure : Body. fjT Heading1Body. fkT Heading2Body. flT HeadingRunInBody. X,X,Xfm
X
TableFootnote. fn b.ŊF(Of]tPdA)x۴>y>0u2+GfF:^Ɏ<>:v,Sy. TӷhCode. ,X,Xfo,X. Indented. fp
Body In Framel. fqU
AbstractHeadAbstractAbstract,Xfr,X. Numbered.\t. ,XfsEx,X. Numbered1y.\tNumbered. X,XftNAX. NumList1
N:.\t NumList2+. X,XfuNX. NumList2+ N:.\t. fvFaFigure1F:Figure : Body. BfwRAB. Reference2+Bold
R:[]\tReference2+. * Xfx* . DashBold\t. xX,X,Xfy X Footnote. ,XfzBody. {
CellBodyL. f|T
TableTitleT:Table : . }
fe!6.
CellBody.J\t. ~
CellBodyR. ,XfU PublicationHeadSource of publicationPublicationfBodyLeft. f CellBodyC. fT!Table1T:Table : . ,Xf@
AbstractIntroduction. fT!Table2+T:Table : . @ PuB.o
HeaderLeft. @ . ̡<.HeaderRight. BfRAB.
Reference1BoldR:[]\tReference2+. X,XfP ,X.
ReferenceHeadBodyLeft. XXfX. Indent1. BfRAB. Reference2+Bold
R:[]\tReference2+. fQ TitleAuthor. * * fX. Indent2. X,XfHQ ,X. Introduction H:\tBody. X,XfHQ ,X. 1HeadH:\tBody. XfHQ X. 2HeadH:.\tBody. * fHQ
* . 3HeadH:..\tBody. fU
OriginalPublicationHeadSource of PublicationOriginalPublication fP
* . 4HeadBody. X,XfX. Bullet\t. ,Xf@ PublicationIntroduction. ,Xf@
KeywordsIntroduction. fU
KeywordsHeadKeywordsKeywords,Xf@
CategoriesIntroduction. ,Xf@
GeneralTermsIntroduction. ,Xf@
AdditionalKeywordsIntroduction. ,Xf@
OriginalPublicationIntroduction. fU
GeneralTermsHead
General TermsGeneralTermsfU
AdditionalKeywordsHeadAdditional Keywords and PhrasesAdditionalKeywords
fU
CategoriesHead"Categories and Subject Descriptors
Categoriesion
CellBodyRi.
!6.
CellBody.J\t. @ B.
HeaderLeft. lT@
Footer. @ ̡<.HeaderRight. X,XfNAX. NumList1
N:.\t NumList2+. fQ TitleAuthor.
f@
AuthorAbstractHead. X,X,Xf X Footnote. fU
AbstractHeadAbstractAbstract,Xf@
AbstractIntroduction. X,XfHQ ,X. Introduction H:\tBody. fU
KeywordsHeadKeywordsKeywords,Xf@
KeywordsIntroduction. X,XfX. Bullet\t. X,XfNfX. NumList2+ N:.\t. * fHQ
* . 3HeadH:..\tBody. fCellHeading. X,XfP f,X.
ReferenceHead
Reference1. fCellHeading. fFAFigure2+F:Figure : NBody. fU
PublicationHeadSource of PublicationOriginalPublicationf@
PublicationIntroduction. fP
* . 4HeadBody.
fE.EquationE:(). 1ڝ)`Bold) ڝ#EqtuEq3tu Subscript tuEqd[
ڝ3yc>Caption3[ CaptionBold EmphasistuEqSuperBold i_mF\S
Dingbat"
Dingbat `f
Code
ڝ ڝ) )
ڝ ڝtuEmphasis
uo&
) )
i_mFڝBold3tuEqSuperBtuSpecialh
) # Code ڝSuperscript#Superscript 3ڝExFract tuExFract tu ExFractUL 3tu ExFractULڝSuperscripttuEqSub 3tuEqSub3tuEqBigtuEqBigwSymbol#Symbol
) UB@ZZ@ZZZZ@IZryyuThinzxFrMedium{uDouble|Thick@} Very Thin~HairlineDashedSingle
Two SingleuFatqBiDotted ~~~~MNNNCodeyyyyyyyyy{UBFݺVWVFݺVWVFݺVWVFݺVWVFݺVWVFormat AyyyzyyyyyzOGTGTGToutlinedOGTGTGT
no outliney~~yyyyyyyR]]]]
ChangeHistyzyyyyyzUeFݺTFݺTFݺTFݺTFݺTFormat BqbqbyyyyyyyyyyXU
ZU
U
]NU
Centredyyyyyyyyy{XFݺTTTeTTTFormat A~~~~MyNNNCode:u=CFݺVWVeVWV 0>HNNNO 0INNNN k@OZNNN6X[b>\VWV>\VWVMzVWVMzVWVMzVWV]>\VWV>\VWVh u 6cpNNN=}d38b
=?@T0=T1=T8b
>DC2>P :?=@@yX4?5?U
P :@?A@6@7@P :!A@B@yy8A9AP :BAC@:B;BP* :CB@yiCNjC8b
D>ECݺ3DVW8b
EDFCnE8bJ
FEGCNoF 8b
GFHCpG8b
HGCqHMz8bƹ" IJIkI8by" JIKIb lJ8b9" KJLI1mK
8b" LKMI2rL 8b" MLNI4sM8by" NMItN8bl
OPLuO8bu POQLvPB8b QPRLwQ:8bJ RQSLRC8b
SRTLSD8b TSULTE8b UTVLUF8bJ VUWLVG8b
WVXLWH8b XWYLXI8b͊ YXZLbYJ8bJ ZYL1ZK8b
[\U2[L[[M[[[y"[8 %\[]Ub
\\L\u\8b\\OQ\8$ ]\^UR]]:] ]]]R]8;^]_U^^T^^'^^^L8S_^`U_L__8b__VX__H8k`_aU`I` ```Y``Z8ca`bUa[aaaaaa8GbaUb\bbbbbb8b) ceCc]8b3J ecfCe]8b=
fegCf8bF gfhCg8bP hgiC'h8bZJ ihjCiU8bd
jikCj8bm kjlCk8bw lkmCl`8bJ mlnCm8b
nmoCbn8b onpCo8b poCGpAQABHP specificCComment5eF+d5u6T7eW8g9bJx:;-<.=/>0g?1@2A3B4C5bD67E7F8H:I;j#J<)K=b
2L> 9M? DN@ KOA
RPB
l\QC cRD kSE rTF zUG bVH WI XJ YK # "ZL % $o[M ' (b\N ) *b]O , +^P / ._Q 2 1`R 5 4
d GBlackT!WhitesddAmmRedddGreendd BlueudCyanedgMagentaJd Yellow
CiTR Green5582MidGrey LightGreyd^CiTR Red1795d7PANTONE Pro Blue CVPalatino-RomanCourier
Times-BoldTimes-RomanTimes-BoldItalicTimes-ItalicHelvetica-Bold HelveticaZapfDingbatsZapfDingbatsCourier-BoldSymbolSymbolCourierWTimes Helvetica SymbolPalatino
ZapfDingbats
RegularRegular BoldRegularItalic[a7n!>\QfZY:ZZ`rRJf\ϬXX=0R`NPJY~iޥREml9^0I~y
c,fմ2
%7Q
K\__