

Mark Jawad
Senior Software Engineer
Software Development Support Group

DEVELOPERS CONFERENCE



# **Topics**

- Review of DS system architecture
- Role of the ARM7
- Role of the ARM9
- ARM9 arch. review
- Code Gen review
- · Bus review
- Caches and TCM
- Implications of what we've learned so far
- Rules of THUMB

- Main Memory Display Mode (+DMA)
- Card DMA
- Interrupt processing and best-practices
- Fast data uploading during V-Blank
- · Asynchronous processing

Nintendo Confidentia

(Nintendo)



#### Point out:

ARM9 + associated pieces

ARM7 + attached peripherals

Work Ram, and how it's attached





ARM7TDMI is the same kind of chip used in the GBA; makes sense given the GBA-compatible mode that DS has.

This is how we achieve hardware-level compatibility with the GBA.

Note that on a Von Neumann machine, there's only 1 s

# Operation of ARM7

- Basic program operation
  - Creates a bunch of interrupt handlers
  - Drops into an idle loop
  - Interrupt handlers takes care of most needs
    - PXI requests, external device updates, etc
    - Some requests are too heavy-weight to do in an interrupt handler. These are sent to threads.
  - Threads handle "long-term" tasks
    - For audio, wireless, and other complex needs
    - Thread schedules may cause delays

(Nintendo<sup>°</sup>)

## Notes on ARM7

- · Doesn't have a cache
- Does have onboard RAM
  - Has 64K of internal Work RAM
  - Reserves all of the ARM7 / ARM9 "shared" Work Ram space for exclusive use (32K)
- Has priority access to main RAM!
  - Due mainly to Wireless and Audio needs
  - This will impact your game
    - But the amount depends on features used

DEVELOPERS



## ARM7 memory access patterns

- Most code / data lives in ARM7 dedicated memory and shared work ram
  - Not in main memory
  - Very few code fetches from main memory
  - Data transmissions to/from main memory are occasional
    - Auto-sampled data (TP, MIC)
    - Audio data (samples, etc)
- Wireless complicates the picture a little

Nintendo Confidential

Nintendo

Need to discuss bus access pattern for all parts of the system here.

Audio data is transferred to FIFO by sound DMA. 32 bytes of data is transferred at the first time, then 16 bytes thereafter.

For Microphone and touch panel, one or two u16 word(s) written at once during the sampling.

Goes out to ram to write things like the X/Y button values, and RTC

# **ARM7 SDK Components**

- "Mongoose" component
  - Wireless code / data must be fetched from Main Memory\*
  - So more traffic to main mem when wireless is active
    - Remember, no cache on ARM7
- "Ichneumon" component
  - Wireless code / data is fetched from VRAM\*
    - But locks VRAM banks C, D



- •Actually, the connection setup and teardown code is fetched from main memory.
- •The code that operates while the connection is established is within the dedicated work ram + shared work ram

#### Role of the ARM9

The "main processor"

- Reserved for your game code
- Mostly under your control
  - Some SDK pieces may be doing things you weren't aware of...
    - Threading
    - TCM use
    - Blocking on PXI command results
  - ..but we give you SDK source so you do have complete understanding

DEVELOPERS



### **ARM9 Architecture Review**

- See slides from previous DevConf for more detailed info on features, instructions
- Things worth noting:
  - DSP instructions can be beneficial, but are only accessible via ARM assembly code
  - Put data into local vars wherever possible
  - PLD (data preload) instruction is *ignored*
  - Instruction Cache preload is supported

DEVELOPERS



- Compiler places literal pools after each function
  - Any function that needs data other than the incoming parameters
- What's a literal?
  - Roughly: any data larger than a byte
    - · Which means pretty much everything
    - Some literals can be placed in the actual instruction opcode (rare)
    - Most things (u32 and smaller) are stored in the literal pool
    - For larger data, lit pool contains the address of the data
  - Also: Addresses of any static or global variables

Nintendo Confidential

(Nintendo)

Lit pools mean more time on the bus (2x from normal code-only/data-only lines)

- Implications
  - There's at least 1 cache line with code + data overlap
  - No L2 cache means that the I-cache will need to load the line from RAM, and the D-cache will also load the line from RAM.
  - Can be detrimental to performance-critical functions

DEVELOPERS



- The more globals / statics accessed, the larger the literal pool
  - And you can see where that leads: program bloat
- Loading a global, static, et al takes 2 loads
  - 1<sup>st</sup> one is PC-relative, gets the addr of the global from the lit pool.
  - 2<sup>nd</sup> one actually retrieves the data
- Pack static or global variables into a structure to avoid the hit

DEVELOPERS





Literal pool in this case points to 3 different / unique strings. So, can't compact it via a single structure..

Well, you could, but you have to go out of your way...

- Many ARM opcodes can have constant data of 8 significant bits or less
  - Bypasses the literal pool
  - Doesn't necessarily mean that you're limited to a single byte
    - 0xff
    - 0x0e10
    - 0x07000000
- Compiler always creates "complex" literal instead of burning 2-3 insns to generate it
  - i.e. 0x07000400

DEVELOPERS



#### Table 2-1 : Memory Configuration and Specifications

| Memory Type                                    | Bus<br>Width | Access<br>Cycle        | Bit Width that<br>Allows DMA<br>Access |       | Bit Width that<br>Allows Main<br>Processor Access |         |
|------------------------------------------------|--------------|------------------------|----------------------------------------|-------|---------------------------------------------------|---------|
|                                                |              |                        | Read                                   | Write | Read                                              | Write   |
| DS Accessory RAM<br>(SRAM, flash memory, etc.) | 8            | 6-18                   | -                                      | -     | 8                                                 | 8       |
| DS Accessory ROM (ROM, flash memory, etc.)     | 16           | 1st 6-18<br>2nd 4-6    | 16/32                                  | 16/32 | 8/16/32                                           | 16/32   |
| OAM                                            | 32           | 1                      | 16/32                                  | 16/32 | 8/16/32                                           | 16/32   |
| VRAM                                           | 16           | 1                      | 16/32                                  | 16/32 | 8/16/32                                           | 16/32   |
| Palette RAM                                    | 16           | 1                      | 16/32                                  | 16/32 | 8/16/32                                           | 16/32   |
| I/O Registers                                  | 32           | 1                      | 16/32                                  | 16/32 | 8/16/32                                           | 8/16/32 |
| Internal Work RAM                              | 32           | 1                      | 16/32                                  | 16/32 | 8/16/32                                           | 8/16/32 |
| Main Memory                                    | 16           | 1st R:5 / W:4<br>2nd 1 | 16/32                                  | 16/32 | 8/16/32                                           | 8/16/32 |
| System ROM                                     | 32           | 1                      | -                                      | -     | 8/16/32                                           | -       |
| TCM/Cache                                      | 32           | 1/2                    | -                                      | -     | 8/16/32                                           | 8/16/32 |

The values given for the number of access cycles correspond to a bus frequency of 33.514 MHz.

Furthermore, these values are for when memory is accessed in a bit width that is equal to or less than the bus width. When memory is accessed in a bit width that is larger than the bus width, the number of access cycles is limited to the bit width divided by the bus width.



- Most important data moves via 16-bit bus
  - Think about that for a second
- IO register space is 32-bit
  - Graphics chip settings, 3d commands
  - Sprite data (OAM) access
  - Game Card access
- DMA incurs same penalties as CPU
  - But benefits from "burst mode"

Nintendo Confidential

Nintendo

Burst Mode of DMA gives it an advantage for bulk data transfers, and is described in the NITRO Programming Manual.

- Not shown: VRAM contention
  - If a bank is being used by the graphics engines and you try to access it with CPU, a stall occurs on ARM9
  - Reverse: Causes graphics to flicker
  - How much of a stall?
    - Core clock: dot clock == 6:1
    - ARM9 clock : dot clock == 12:1
    - Most 2d data access takes multiple dot clocks

DEVELOPERS CONFERENCE

Nintendo Confidential



Obviously, if a bank isn't mapped to ARM9 then there's no contention..

Bottom line: don't access VRAM / graphics registers outside of v-blank

**Bus Arbitration** 

- What happens when ARM9 and ARM7 try to access main memory at the same time?
  - ARM7 wins (due to EXMEM priority setting)
  - Once a client has the bus no one can interrupt them
- How long does a client lock the bus?
  - Duration of transaction!
  - So don't do huge transfers; you stall all other clients!

Nintendo Confidential

(Nintendo)

Problems caused by large bus transactions are mostly noticeable when wireless is on or MIC sampling is turned on at a high frequency.

Odds are good that you'll see little problem with large transactions for many single player games. But your audio use can impact that...



### **ARM9 Cache review**

- 8KB I-Cache
- 4KB D-Cache
  - Warning: Additional (active) threads eat up space due to stack activity
- 32byte cache lines
  - You can lock down a chunk of RAM
    - 1k increments, 32byte aligned
  - Tricky to use it correctly
    - Usage puts pressure on existing lines
    - You need to make sure that the code or data is linked with the correct alignments
- Line must be filled before execution can resume

DEVELOPERS Conference Nintendo Confidential



As noted earlier, you can only pre-load the instruction cache.

Also, the main thread stack is located in DTCM

## **ARM9 TCM review**

- Tightly Coupled Memory
  - 32KB ITCM
    - · Can store code+data
  - 16KB DTCM
    - Data only; never seems to be enough of it
  - TCM access does not use the bus!
- TCM is not dynamically allocatable
  - Well, tools don't support it nicely
  - But you can build your own support anyways
    - And you should. It's not too hard to do.

Nintendo Confidential

(Nintendo)

Main thread stack is allocated in DTCM by default

# A Quick Summary..

- Bus contention for Main Ram or VRAM can stall the ARM9, slowing down your game
- No need for the bus:
  - TCM access
  - In-cache code and data access
- Needs the bus:
  - Main Memory access
  - VRAM access
  - Some memory-mapped I/O registers

DEVELOPERS



## ..and some numbers

- Bus speed: 33.514MHzARM9 speed: 67.028MHz
- Bus Width ↔ Main Memory: 16bits
- ARM9 Cache line: 32 bytes
- ARM instructions: 32bits each
- THUMB instructions: 16bits each
- D-Cache: 1024 32-bit words total
  - Some of it lost to literal pool & code overlap
  - And stacks for additional threads





- •ARM9 goes 40+ cycles in the time that it takes the bus to fill an entire cache line (ie, 20 bus cycles)
  - •Then gets to execute (AT MOST) 8 or 16 instructions before needing another cache line
- •If that cache line isn't available, then the process repeats

### Run the numbers

- On I-Cache miss:
  - ARM: 40+8 cycles of time for 8 instructions
  - THUMB: 40+16 cycles of time for 16 instructions
- · D-Cache miss is similar
  - 40 cycles later, you get your data
  - Hope you wanted more than 1 word from the cache line

DEVELOPERS CONFERENCE

Nintendo Confidential

(Nintendo<sup>®</sup>

#### ARM:

- $\bullet 8/48 = 1/6 = 16\%$  efficiency
- •80 instructions could take up to 480 cycles when i-cache always misses

#### THUMB:

- $\bullet$ 16/56 = 2/7 = 28% efficiency
- •80 instructions could take up to 285 cycles when i-cache always misses

One could argue that ARM code is 1.5x (or more) efficient than THUMB, but THUMB makes up for it by virtue of cache wins. Might end up being a tie, but we're thinking that THUMB is a slight win overall.

# **Implications**

- Your game is memory bound
- This is why Cache and TCM are so incredibly important
  - And why all the interrupt handlers are in TCM
- Locality is key to performance

DEVELOPERS







- Use 16bit THUMB instructions
  - As fast as 32bit instructions, but ½ the size
    - · Crams more code into the Instruction Cache
  - Switching ARM↔THUMB is free ("blx" insn)
  - May cause literal pool to be slightly larger
  - But great if you are tight on RAM
    - And who isn't?
  - Easy to do:
    - #include <nitro/code\_16.h>

Nintendo Confidential

Nintendo

- Functions that aren't called often ought to be in THUMB mode anyway
  - Initialization code is usually large. THUMB it!
- Functions that are simple should be in THUMB mode too
  - Most "getter" functions are suitable
  - Simple Boolean tests, too
- Disassemble your code in both ARM and THUMB and choose the version that gives best size/cycle tradeoff

DEVELOPERS CONFERENCE

Nintendo Confidential

Nintendo

If there is more than one bit operation (mask, shift, insert, extract) then ARM code is usually the winner.

But for cases where simple loads, stores, or comparisons are done then THUMB is usually 0 to 3 instructions larger and is half the size of ARM code.

- Remember when I said that THUMB instructions were just as fast as ARM instructions?
  - Not exactly true.. there's interlock involved with most of them which causes some stalls on each cycle
  - But you're totally stalled by the bus most of the time anyways so it doesn't really matter
  - And you get 2x THUMB code per cache line, so realistically you're getting more work done
- SDK defaults to 100% ARM code (for both processors) unless you specify otherwise

DEVELOPERS



Some downsides

- Fewer registers available means more traffic to the stack
  - Most instructions can only access r0-r7
  - Limited instructions for access of r8-r15
- No win for branch-heavy calls
  - Jumping to other functions usually take 2 16bit instructions back-to-back
    - so 32bits per branch not a win

DEVELOPERS





# Main Memory Display Mode

- Possible Use:
  - Can use it while 2d/3d is being captured to VRAM
  - Generate a data on the fly or show static screen
  - Manually post-process a captured image
- Notes:
  - Once the mode is active, you MUST keep feeding it new data
  - Otherwise, it uses last data in the FIFO (ugly!)
  - Takes more hand holding than VRAM display mode

DEVELOPERS CONFERENCE

Nintendo Confidential



Anybody using this? Would love some feedback on that.

#### Why use it?

- 1. Frees up VRAM for other purposes
- 2. You can be drawing the screen just ahead of the DMA read stream

#### Why not use it?

1. Can't exec auto-start DMAs

## Main Memory Display Mode

- Uses AutoStart signal from LCDC
  - Data copy happens when the pixel display FIFO has room for another 4 words
  - Transfers 4 u32's, then goes idle (off bus)
  - Still Enabled, though, so channel is "locked"
- DMA completes when you hit V-Blank
  - You'll need to re-start the DMA for each upcoming frame
  - Do re-start during V-Blank
- Can't use other auto-start DMA modes
  - Immediate mode DMA ok, though

Nintendo Confidential

(Nintendo)

It's not necessary to pre-draw the whole screen before starting the transfer, but make sure not to let it grab garbage.

Roughly: Transfer activity happens around time that line is being drawn to LCD; goes idle during H-Blank

#### Card DMA

- Card DMA functions similarly to Main Memory Display DMA
  - DMA channel is enabled for duration of entire transfer
  - Card AutoStart signal tells it to do some work
    - Only one u32 is transferred at a time!
    - Then goes idle (off the bus)
  - The cycle repeats until transfer is complete
    - Once all data is transferred, DMA is set to Disabled

Nintendo Confidential

(Nintendo)

Remember: If the SDK usually checks a DMA channel before attempting to use it.

If the channel is marked as "Enabled", the SDK will stall until it goes to "Disabled".

So bad things happen if (for example) your FS and GX are set to use the same DMA channel.

•Technically, the auto-start only happens on a per-card-page basis. Our SDK sorta hides this from you to make life easier.

#### Rules for Async Card read

- Follow the CARD\_ReadRomAsync guidelines
  - Make sure that all data is aligned to 512byte boundaries on the card (.rsf can specify this)
  - Read multiples of 512bytes
  - Target destination must be 32byte aligned
- Use DMA channel 3
  - Use higher channel numbers for GX & WM
- Cache Notes

DEVELOPERS Nintendo Confidential

Nintendo

Async is best if you can pre-load data before you actually need to display it.

Pre-load your splash screens while the first one is being displayed.

Pre-load your menu while the final splash screen is being displayed.

Take a best-guess at what level the player will be playing, and pre-load as much as you can *before* the user requests to move from your menus into the game.

This is the only way to get that GBA-like quality of "instantaneous" level loads

Cache notes: upcoming patch to SDK 4.x will make large data loads happen faster, due to invalidating the entire cache rather than one line at a time.

#### Interrupt processing

- · Get in, then get out
  - If possible, just set some flags and deal with it later
  - Sometimes you do have do actual work, but make it quick
- This is an embedded device
  - Not a lot of time to burn; cycles are precious
  - Try to put the callbacks in ITCM
- You will destabilize the system if you take too long in a callback or interrupt handler!
  - SDSG has seen WAY too many cases of "heavy lifting" taking place within an interrupt handler
  - And we've seen the chaos it causes.
  - We thought you ought to know...

DEVELOPERS Conference



#### V-Blank Handler

60Hz (during production)

```
#include <nitro/itcm_begin.h>

NITRODEVCAPS _devCaps;

static void handle_vbl_intr(void)
{
   if (
      devCaps.m_dwMaskResource & NITROMASK_RESOURCE_VBLANK
   )
      NITROToolAPIVBlankInterrupt();

   // set the flag saying that we've
   // dealt with the interrupt.
   Os_SetIrqCheckFlag(OS_IE_V_BLANK);
}
#include <nitro/itcm_endNihoodo Confidential</pre>
Nintendo
```

# V-Blank Handler 60Hz (finalrom) #include <nitro/itcm\_begin.h> static void handle\_vbl\_intr(void) { // set the flag saying that we've // dealt with the interrupt. OS\_SetIrqCheckFlag(OS\_IE\_V\_BLANK); } #include <nitro/itcm\_end.h>



#### Fast data upload for V-Blank

- Don't wait until V-Blank to decide what data to upload
- · Determine all necessary info ahead of time
  - Function that you'll use to do the upload
  - Source and Destination Addresses
  - Byte count to xfer, VRAM offset
- Put this info in a list or queue
- Then run through the list when V-Blank hits and dispatch it all

Nintendo Confidential

Nintendo

Actually, you might want one list for 3d items, and then another list for everything else. This is because the 3d V-Blank window is quite small, and happens at the beginning of the V-Blank period.

# Why? • Will simplify game logic • Will simplify V-Blank processing • Makes the most of V-Blank time

```
Fast data upload for V-Blank

// can only call during V-Blank

GX_LoadOBJPltt(pObjPltt, 0, 32);

// want ability to call this anytime during the frame

defer(GX_LoadOBJPltt, pObjPltt, 0, 32);

// so main loop looks something like..

while(true)

{
// ...
OS_WaitVBlankIntr();
dispatch_deferred_functions_then_clear();
// ...

Nintendo Confidential

Wintendo
```

### Fast data upload for V-Blank

This effectively gets us:

- But you can dynamically have many different calls
- And they can vary on each frame
- Gives us immense flexibility

DEVELOPERS

Nintendo Confidential

Nintendo

## Notes on the Deferred Function Handler

- Can call any number of functions
  - Limited by RAM
  - FIFO call order
- Each call can have between 0-4 params

// 3 params
GX\_LoadOBJPltt(pObjPltt, 0, 32);

 System provides temporary storage space in case you need to record a value in time

DEVELOPERS





# Fast data upload for V-Blank (code) // 1) load packed control word + functionAddr // 2) then extract num args + bytes to skip from control word ldmia curPtr!, {bytesToSkip,procAddr} mov argsToLoad, bytesToSkip, LSL #28 mov bytesToSkip, bytesToSkip, LSR #4 // argsToLoad got converted to condition bits.. move em into CPSR\_FLAGS msr CPSR\_F, argsToLoad // now load the appropriate set of registers based on the condition bits. // order is: n,z,c,v ldmvsia curPtr!, {r0} // <cond>vs == v flag only ldmcsia curPtr!, {r0-r1} ldmeqia curPtr!, {r0-r2} ldmmiia curPtr!, {r0-r2} add curPtr, curPtr, bytesToSkip // add any left over bytes blx procAddr // make the call

# Fast data upload for V-Blank (code) // 1) load packed control word + functionAddr // 2) then extract num args + bytes to skip from control word Idmia curPtrl, {bytesToSkip,procAddr} mov argsToLoad, bytesToSkip, LSL #28 mov bytesToSkip, bytesToSkip, LSR #4 // argsToLoad got converted to condition bits.. move em into CPSR\_FLAGS msr CPSR\_F, argsToLoad // now load the appropriate set of registers based on the condition bits. // order is: n,z,c,v Idmvsia curPtrl, {r0} // <cond>vs == v flag only Idmedia curPtrl, {r0-r1} // <cond>cond>cs == c flag only Idmedia curPtrl, {r0-r2} // <cond>mi == n flag only add curPtr, curPtr, bytesToSkip // add any left over bytes blx procAddr // make the call

## Fast data upload for V-Blank (code) // 1) load packed control word + functionAddr // 2) then extract num args + bytes to skip from control word Idmia curPtrl, {bytesToSkip,procAddr} mov argsToLoad, bytesToSkip, LSL #28 mov bytesToSkip, bytesToSkip, LSR #4 // argsToLoad got converted to condition bits... move em into CPSR\_FLAGS msr CPSR\_F, argsToLoad // now load the appropriate set of registers based on the condition bits. // order is: n,z,c,v Idmvsia curPtrl, {r0} Idmesia curPtrl, {r0-r1} Idmeqia curPtrl, {r0-r2} Idmmiia curPtrl, {r0-r2} Idmmiia curPtrl, {r0-r3} Add curPtr, curPtr, bytesToSkip-// add any-left over bytes blx procAddr // make the call

# Fast data upload for V-Blank (code) // 1) load packed control word + functionAddr // 2) then extract num args + bytes to skip from control word Idmia curPtrl, {bytesToSkip, procAddr} mov argsToLoad, bytesToSkip, LSL #28 mov bytesToSkip, bytesToSkip, LSR #4 // argsToLoad got converted to condition bits.. move em into CPSR\_FLAGS msr CPSR\_F, argsToLoad // now load the appropriate set of registers based on the condition bits. // order is: n,z,c,v Idmvsia curPtrl, {r0} // <cond>vs == v flag only Idmcsia curPtrl, {r0-r1} // <cond>cond>cs == c flag only Idmeqia curPtrl, {r0-r2} // <cond>mi == n flag only Idmmiia curPtrl, {r0-r3} // <cond>mi == n flag only add curPtr, curPtr, bytesToSkip // add any left over bytes blx procAddr // make the call

### Fast data upload for V-Blank

(code)

```
// 1) load packed control word + functionAddr
// 2) then extract num args + bytes to skip from control word
Idmia curPtr!, {bytesToSkip,procAddr}
mov argsToLoad, bytesToSkip, LSL #28
mov bytesToSkip, bytesToSkip, LSR #4
// argsToLoad got converted to condition bits.. move em into CPSR_FLAGS
msr CPSR_F, argsToLoad
// now load the appropriate set of registers based on the condition bits.
// order is: n,z,c,v
Idmvsia curPtr!, {r0} // <cond>vs == v flag only
Idmcsia curPtr!, {r0-r1} // <cond>cs == c flag only
Idmeqia curPtr!, {r0-r2} // <cond>mi == n flag only
Idmmiia curPtr, curPtr, bytesToSkip // add any left over bytes
blx procAddr // make the call
```

```
Fast data upload for V-Blank
(code)

void GX_LoadOBJPltt(const void *pSrc, u32 offset, u32 szByte)
{
SDK_NULL_ASSERT(pSrc);
SDK_ASSERT(offset + szByte <= HW_OBJ_PLTT_SIZE);
SDK_ALIGN2_ASSERT(offset);
SDK_ALIGN2_ASSERT(szByte);
GXi_DmaCopy16(GXi_Dmald, pSrc, (void *)(HW_OBJ_PLTT + offset),
szByte);
}

Nintendo Confidential
```

## Fast data upload via Deferred Function Calls

- No more if / switch logic at V-Blank
   Or worse
- Maximize upload time
- "Fire and forget" graphics upload commands during game logic
- Flexible function call system
- Can be used for other things, too

DEVELOPERS



#### One small problem...

(which we've mentioned before)

- Might not want to upload a huge chunk of data via single DMA
  - because it blocks all clients on the device
- Problematic when Audio and Wireless are in heavy use
- Solution: Chunk up your data uploads into smaller DMAs so that the other clients get access to Main Memory

DEVELOPERS

Nintendo Confidential

Nintendo

Experiment to see what works best. Start with 1KB chunks and scale up or down from there.

### Async functions

- Not all functions are equally asynchronous
- Some dispatch work to ARM7, others to the DMA controller, and yet others queue work for separate threads
- Functions such as FS\_ReadFileAsync will actually fall into synchronous mode if the parameters aren't "perfect"

DEVELOPERS



### Recap

- ARM7 CPU operation
- ARM9 CPU operation
  - And interaction with main memory
- Bus operation
- DMA controllers
- V-Blank handling
- And More!

Nintendo Confidentia

Nintendo

### Take-away

- Come away with a better, deeper understanding of overall system design
  - And how to fit into it
- Knowledge to build engine and game code to make best use of Nintendo DS

DEVELOPERS

Nintendo Confidential

Nintendo<sup>®</sup>

