======Introduction====== The ProtoFlex User Guide is intended for first-time users of the ProtoFlex simulator. This guide covers the basic ProtoFlex concepts, the hardware and software installation procedures, and the process for staging and running your first simulation on the FPGA. At the minimum, this guide assumes you are familiar with computer architecture concepts and general simulation tools. No prior knowledge of FPGAs are needed (unless you intend to make modifications). \\ ======The ProtoFlex Simulator====== The ProtoFlex project is an open-sourced simulator developed at Carnegie Mellon University to facilitate scalable, shared-memory multiprocessor research using FPGAs. In its basic form, ProtoFlex simulates a functional model of an N-way UltraSPARC III server system and is able to run unmodified, multithreaded applications on a Solaris operating system. ProtoFlex is a parameterizable simulator and has been shown to simulate up to 16 processors on a {{http://bee2.eecs.berkeley.edu/|BEE2 FPGA}} platform. The version of the ProtoFlex simulator that you will be using has been ported over to the {{http://www.xilinx.com/univ/xupv5-lx110t.htm|XUPV5-LX110T platform}}, which is a widely-available commodity FPGA platform. Throughout this guide, we will assume the following terminology. A **target system** refers to the simulated machine that we are interested in modeling (in the case of ProtoFlex, this is the Serengeti-based UltraSPARC III server). A **host system** refers to the underlying collection of hardware and software used to support the simulation of the target system. This includes the FPGA platform as well as software components that run on an x86-based workstation. The **target** machine that we will be simulating on the FPGA is a functional model of a 4-CPU UltraSPARC III shared-memory server. The target application that runs on this model will be the Solaris 10 operating system. We will also stage and run a simple multithreaded microbenchmark within the operating system. \\ ======Licensing====== The ProtoFlex simulator is released under the GNU GPLv2 license (http://www.gnu.org/licenses/gpl-2.0.txt). Please respect all terms and conditions that apply. In plain english, any modifications you make and release publicly must also be accompanied by source code (an also re-released under the same licensing terms). \\ ======Release notes====== * ProtoFlex 1.0 (9/20/09). {{:documentation:protoflex_1.0.tgz|Download}}. First initial release of the ProtoFlex simulator. Current supported platforms: XUPv5-LX110T. \\ ======Disclaimer====== Before continuing further, it is important to understand the limitations of the platform release for XUPv5. The ProtoFlex project was originally developed and optimized for the BEE2 FPGA platform with Virtex-2 Pro FPGAs. Since Xilinx no longer plans to support the Virtex-II parts in future releases of ISE/EDK, we are also discontinuing ProtoFlex on the BEE2 platform. Due to limitations of the XUPv5 board, the ProtoFlex simulator runs in a limited-mode configuration. First, due to BRAM capacity constraints, we are currently limited to simulating only 4 CPUs. Given the long pipeline depth of our design (14 stages) and running at 100MHz, we should not expect performances higher than a theoretical 28.5 MIPS (100MHz / 14 x 4). Other limitations also include using a Microblaze core to facilitate hybrid simulation instead of a PowerPC core (which normally exists in Virtex-2 Pro). Given the drastic differences in the microarchitecture and clock rate (400MHz -> 100MHz), we do observe some negative performance impact in various workloads. Furthermore, the microblaze and PowerPC are forced to share a single memory controller, which also has a negative impact on performance. Due to extra resources needed for the microblaze, there is very little additional FPGA resources that can be used to implement "extras", such as a cache simulator or profiler (as described in the publications). Another limitation at the moment is the lack of a hardware-based double-precision floating point unit. Although floating-point instructions are supported through hybrid software simulation on the Microblaze, FP-intensive applications will run poorly on our platform. Nevertheless, it is our intention that by providing a self-contained FPGA project for a commodity evaluation board, the designs can be easily ported and extended by others to more powerful platforms and larger FPGAs (e.g., Nallatech ACP, Convey Computer). Finally, due to the limited amount of time taken to implement the port to XUPv5, you may encounter unexpected issues that we have overlooked. Please send email to [[protoflex@ece.cmu.edu]] if you encounter any bugs or issues with the platform. Please note: we will only support requests so long that you have attempted our infrastructure using the exact requirements as shown below. For example, we will not answer any problems if you are using an untested distribution of Linux. \\ ======Hardware Requirements====== * A computer (Primary) with an Intel-based processor for running SuSE Linux (we recommend at least a Core 2 Duo 2.66GHz) * The Primary PC's motherboard should have a PCI Express x1 lane slot (we have tested the GIGABYTE GA-G31M-ES2L motherboard) * A second computer (Secondary) that is used for programming the FPGA bitstream and monitoring the RS232 output * The XUPV5-LX110T FPGA Board Platform (http://www.xilinx.com/univ/xupv5-lx110t.htm) * 2GB Micron DDR2 SODIMM Memory (Part #: MT16HTF25664HY-667E1) * Xilinx USB Programming Cable * RS232 Cable * Optional: USB-to-Serial converter (necessary for machines without serial ports). This one works for both PCs & MACs: http://www.newegg.com/Product/Product.aspx?Item=N82E16812156003 \\ ======Software Requirements====== * 32-bit openSUSE Linux 11.1-i586 (http://software.opensuse.org) * Virtutech Simics 3.0.22 (http://www.virtutech.com) * Bluespec SystemVerilog Compiler v2008.11.C * Xilinx ISE 10.1 w/ SP3, Xilinx EDK 10.1 w/ SP3 and latest Coregen IP update * Optional: Synopsys VCS Y-2006.06 (for Verilog simulations) \\ ======Installing the software====== ====Installing and configuring openSUSE 11.1 on the Primary PC==== The Linux PC (hereon referred to as Primary) should be installed using the 32-bit version of openSUSE Linux 11.1 (can be acquired from http://software.opensuse.org). The kernel installed in our setup is 2.6.27.7-9-pae: > uname -a > Linux linux-pwbv 2.6.27.7-9-pae #1 SMP 2008-12-04 18:10:04 +0100 i686 i686 i386 GNU/Linux ====Adding required packages==== Once you are at the openSUSE terminal, you must install a number of required packages: sudo zypper install gcc sudo zypper install gcc-c++ sudo zypper install subversion sudo zypper install ncurses-devel ====Installing and configuring Virtutech Simics 3.0.22 on the Primary PC==== To install and run Simics on the Primary PC, it is necessary to acquire a FlexLM license from Virtutech (www.virtutech.com) and have it installed on a FlexLM server. Instructions for acquiring and installing an academic license can be found here: http://www.virtutech.com/academia/licensing.html. Instructions for downloading the Simics package can be found at www.simics.net. The package should be: ''package-20-3.0.22-linux.tar.gz''. To install Simics, unpackage this into your home folder: gunzip -c package-20-3.0.22-linux.tar.gz | tar -xvf - This should create a simics folder: ~/simics-3.0.22 Create a new file called .flexlmrc in your HOME directory (e.g., ~/.flexlmrc) and add: VTECH_LICENSE_FILE= To accept the license agreement, cd to ''~/simics-3.0.22/scripts'' and type ''./start-simics''. When you are asked to, agree to the licensing terms and type ''Yes'' We recommend reading the {{:documentation:simics-user-guide-unix.pdf|Simics User Guide for Unix}} and following through the "First Steps" guide and also familiarizing with the concept of Simics checkpoints and machine targets. Specifically, the simulated system that ProtoFlex supports is a ''Serengeti''-based server system that utilizes UltraSPARC III processors. **WARNING: YOU MUST ABSOLUTELY USE VERSION 3.0.22. The Simics API library changes between versions, and we cannot offer any support if you choose to use an unsupported version. Our use of the Simics API library is extensive, and it is unlikely that any untested version will work.** \\ ==== Installing and configuring Bluespec System Verilog on the Primary PC ==== Acquiring the Bluespec compiler requires you to directly contact Bluespec, Inc. @ http://www.bluespec.com/support/index.htm to request an academic FlexLM license. This license must be installed on your FlexLM server. You must then register on the forum at http://bluespec.com/forum, which is currently used to host the Bluespec compiler releases. Once you have unpackaged the Bluespec compiler onto the Primary PC, you should double-check that your **.bashrc** file contains the following: export LM_LICENSE_FILE= export BLUESPEC_HOME=/Bluespec-2008.11.C export BLUESPECDIR=$BLUESPEC_HOME/lib export PATH=$PATH:$BLUESPEC_HOME/bin To verify that your Bluespec compiler is ready for use, type: **bsc --help**. At the bottom, you should see something similar to: License BCOMP expires in 362 days. \\ ====Downloading and compiling the ProtoFlex source code to the Primary PC==== * For EXTERNAL USERS, {{:documentation:protoflex_1.0.tgz|download the tarball}}. Uncompress the tarball by typing:tar -zxvf protoflex_1.0.tgz * INTERNAL USERS ONLY: the most up-to-date code base can be checked out by typing: svn checkout --username svn://miura.ece.cmu.edu/trunk protoflex * We recommend placing all of the source code in a folder such as /home//protoflex. We will refer to this directory as from here on. * Once you have downloaded the source, you will need to add and populate a number of environment variables used in the ProtoFlex simulator within your **.bashrc** file: \\ ^ Environment variable ^ Description ^ Example | | PF_SIMICS | Base directory where Simics is installed | export PF_SIMICS=/home/pf_user/simics-3.0.22 | | PF_HOME | Directory where Protoflex source was checked out | export PF_HOME=/home/pf_user/protoflex | | PF_DIAG | Directory used to store SPARC diagnostics | export PF_DIAG=/home/pf_user/diags | | PF_REG | Directory used to store regressions | export PF_REG=/home/pf_user/regress | \\ * Finally, you will need to add this to your .bashrc file (after where the environment variables are defined): export PF_SUN_HOST=none source /settings.sh * Your ''.bashrc'' file up to this point should look something like this: export LM_LICENSE_FILE= ########################## # Bluespec ########################## export BLUESPEC_HOME=/home/pf_user/Bluespec-2008.11.C export BLUESPECDIR=$BLUESPEC_HOME/lib export PATH=$PATH:$BLUESPEC_HOME/bin ########################## # ProtoFlex ########################## export PF_HOME=/home/pf_user/protoflex export PF_SIMICS=/home/pf_user/simics-3.0.22 export PF_DIAG=/home/pf_user/diags export PF_REG=/home/pf_user/regress export PF_SUN_HOST=none source /home/pf_user/protoflex/settings.sh * To build the ProtoFlex software modules (which are used to faciliate PC-to-FPGA communication), type: $> cd $> make sw * After observing some compilation output, you should verify that the following files have been generated: /apps/pfmon/bin/pfmon /modules/simics_remote_ctrl/simics_listener/x86-linux/lib/simics_cpu_listener.so /modules/simics_remote_ctrl/simics_listener/x86-linux/lib/simics_device_listener.so /modules/simics_remote_ctrl/simics_listener/x86-linux/lib/sparc-irq-bus.so * For an explanation of what each of these software modules do, please refer to the **reference guide (TBD)**. \\ ==== Installing Xilinx Software and IP ==== * Both ISE and EDK 10.1 should be installed on both the Primary and Secondary PCs. The primary PC will require you to install the Linux version of the Xilinx tools while the Secondary PC can be either the Windows or Linux version (we have tested the Windows version). The Primary PC is normally used to synthesize the RTL and generate the final FPGA bitstream. The Secondary PC is used to program and monitor the FPGA and Primary PC while they are running together. * **Disclaimer: we have NOT tested anything using Xilinx 11 tools. Do so at your own risk.** * During installation, do not forget to update to Service Pack 3 and to ALSO install the Xilinx Coregen 10.1i IP Update 3. (Visit http://www.xilinx.com/support/download/index.htm and look for it at the bottom where it says ''Download File Archive''). When you have finished installing the tools, you should place this in your .bashrc file on the Primary PC (change the paths below as needed depending on where you installed the tools): source /home/ise-10.1/ISE/settings32.sh source /home/edk-10.1/EDK/settings32.sh * After installation, you will need to install the ''libdb'' library using ''yast'' (otherwise Xilinx EDK will not run properly). At the command-line, type ''sudo /sbin/yast2''. Under ''Software->Software Management'', search for the ''db43'' (Berkeley DB Database Library) package and install it. After installation, type the following commands: cd /usr/lib sudo ln -s libdb-4.3.so libdb-4.1.so * Due to Xilinx licensing restrictions, there are certain HDL files and netlists related to the PCI express components that we cannot include in the packaged release. These files must be downloaded and generated separately and will require the appropriate IP core licenses. Fortunately, most academic groups enrolled in the Xilinx University Program (http://www.xilinx.com/univ) are eligible to receive this license for free. * We will start by first generating the netlist + Verilog files for the PCI Express Endpoint Plus IP block. To implement these steps, follow the instructions beginning on slide 7 from http://www.xilinx.com/univ/xupv5-lx110t/design_files/PCIe/XUPV5-LX110T_PCIe_x1_Endpoint_Plus_Design_Creation.pdf until slide 18. **When you reach slide 15, rather than inputing ''5050'' for the ''Device ID'' field, input ''0007'' instead.** Note: if Coregen appears to have an out-of-date endpoint block (not 1.9), then you forgot to update your Coregen IP library. * Assuming that you created a folder called ''xupv5_pcie_x1_plus'' in the previous step, there should be a file named **''endpoint_blk_plus_v1_9.ngc''**. Copy this file to ''/platforms/edk/xupv5-1.0/pcores/pcie_ram_v2_00_a/netlist''. * Within the directory named ''xupv5_pcie_x1_plus/endpoint_blk_plus_v1_9/example_design'', copy the two files named **''pci_exp_1_lane_64b_ep.v''** and **''xilinx_pci_exp_ep.v''** to ''/platforms/edk/xupv5-1.0/pcores/pcie_ram_v2_00_a/hdl/verilog''. * Next, visit the webpage at https://secure.xilinx.com/webreg/clickthrough.do?cid=106532 and download the ''dma_performance_demo_x1.zip'' file. This will require you to have a registered Xilinx login account. * Copy the following Verilog files from ''dma_performance_demo_x1/fpga/BMD/rtl'' to ''/platforms/edk/xupv5-1.0/pcores/pcie_ram/hdl/verilog'' on the Primary PC: BMD_32.v BMD_64.v BMD_EP.v BMD.v BMD_INTR_CTRL_DELAY.v BMD_32_RX_ENGINE.v BMD_64_TX_ENGINE.v pcie_endpoint_product.v BMD_CFG_CTRL.v BMD_32_TX_ENGINE.v BMD_RD_THROTTLE.v BMD_TO_CTRL.v BMD_EP_MEM.v BMD_INTR_CTRL.v BMD_64_RX_ENGINE.v BMD_EP_MEM_ACCESS.v pci_exp_64b_app.v * On the primary PC, type: cd /platforms/edk/xupv5-1.0/pcores/pcie_ram/hdl/verilog patch -p1 -i bmd.patch * This command will patch the HDL files to fit our application requirements. You should expect to see the following output: patching file BMD_64_RX_ENGINE.v patching file BMD_64_TX_ENGINE.v patching file BMD_EP_MEM_ACCESS.v patching file BMD_EP_MEM.v patching file BMD_EP.v patching file BMD.v patching file pci_exp_1_lane_64b_ep.v patching file pci_exp_64b_app.v patching file xilinx_pci_exp_ep.v \\ ======Hardware setup====== ====Setting DIP switches==== * The XUPv5 FPGA should have its DIP switches (on the front and back) configured as shown in the following pictures. {{:documentation:dip1.jpg?250|Front DIP Switches}} {{:documentation:dip2.jpg?265|Rear DIP Switches}} ====2GB DDR2 upgrade==== * By default, the XUPV5-LX110T board comes equipped with a 256MB memory SODIMM (on the backside). Unfortunately, due to our FPGA memory requirements, it is necessary to upgrade this part to a larger SODIMM. On the FPGA itself, the Microblaze soft core requires roughly 16MB of memory while the simulated target machines requires a minimum of 256MB. Therefore, at the minimum we require you to have at least 512MB of memory. In our release, we have only tested (and will support) a 2GB DDR2 upgrade. * The picture below shows the part we have successfully tested on the XUPV5-LX110T. The DDR2 specs: MT16HTF25664HY-667E1, 2GB 2RX8 PC2-5300S-555-12, 667, CL5. {{:documentation:ddr2_sodimm.jpg?300|2GB DDR2 Micron SODIMM}} ====Installing the XUPv5 board into the PCI express slot==== * The XUPv5 board should be firmly inserted into the Primary PC's PCI express x1 lane slot. Note: we have only tested on the GIGABYTE GA-G31M-ES2l motherboard. Because the XUPv5 board is large, you will need to remove the clamps normally used to secure the motherboard's DDR2 memory (a flathead screwdriver is needed here). The XUPv5 board may also be slightly flexed against the DDR2 dimms (see picture below). * The RS232 Cable should be attached from the XUPv5 board to the back of the Secondary PC. * The USB JTAG programming cable should be attached on one end to the XUPv5 board and on the other end, the USB cable coming from the JTAG unit should be connected to the Secondary PC. * The XUPv5 should only be powered using the stand-alone AC adapter. {{:documentation:pcie1.jpg?300|XUPv5-LX110T board slotted into the PCIe x1 slot}} \\ ======Preparing a Simics checkpoint====== The ProtoFlex simulator uses the notion of Simics checkpoints to initialize the machine state of a simulated target system (e.g., CPU registers, main memory) that is hosted on the FPGA. A Simics checkpoint is simply a snapshot of simulated machine state in the form of one or more CPU's worth of registers, a physical main memory image, and device state. Checkpoints allow us to stage and position our workloads without having to reboot the target machine over and over. When running Simics, the simulation of a target machine can be interrupted at any moment in order to save a checkpoint. In this section, we will give a short tutorial on what is needed to set up and create your own Simics checkpoints. Note: some of these instructions are borrowed directly from the {{http://si2.epfl.ch/~parsacom/projects/simflex/software/Flexus-Getting-Started-3.0.0.pdf|Flexus Getting Started Guide 3.0}} authored by Evangelos Vlachos as well as the {{:documentation:simics-target-guide-serengeti.pdf|Simics Serengeti Target Guide}}. ====Installing Solaris in a simulated machine==== - The first step is to acquire the Solaris 10 CDROM ISO images, which are freely available for download from http://www.sun.com/software/solaris/get.jsp. The specific edition of Solaris 10 we have tested with is: **Solaris 10 8/07, labeled as sol-10-u4-ga-sparc**. Note: you MUST download the **CDROM** ISO images since the Simics scripts do not handle the DVD version. As of this writing, the 5 CDROM ISO image files that you should expect to have are: ''sol-10-u4-ga-sparc-{v1, v2, v3, v4, v5}.iso''. - The Simics package includes scripts to automate the installation of Solaris within a simulated target machine. These scripts can be found under the ''/simics-3.0.22/targets'' directory. The specific target system that we use for our configuration of ProtoFlex is the **serengeti** target. To make our lives easy, copy all of the ISO images downloaded from the previous step into this folder. - Within the ''/simics-3.0.22/targets/serengeti'' folder, there are a large number of scripts that automate the Solaris installation process. To customize our target machine configuration, first open up and edit the ''serengeti-6800-system.include'' file. - Near the top of the file, you will notice some high-level options for your simulated target machine. Specifically, we are interested in **the number of CPUs** as well as the **number of megs per CPU**. At the minimum, Solaris 10 requires at least 256MB of memory. With respect to the FPGA/board we are using, we are currently limited to only 4 CPUs and at most 1.9GB of simulated main memory. - **VERY IMPORTANT STEP (DO NOT SKIP!)**: At the top of ''serengeti-6800-system.include'', change ''$cpu_class = "ultrasparc-iii-plus"'' to ''$cpu_class = "ultrasparc-iii"'' - For speeding up installation purposes, set the number of CPUs to **1** and the amount of main memory per CPU to **512MB**. These parameters can be changed at a later time after the OS installation completes and the machine is rebooted. - Once you have completed this step, open up and edit the ''abisko-sol10-cd-install1.simics'' file. You should then set the path to the first CD image by setting the line: $cdrom_path = "sol-10-u4-ga-sparc-v1.iso" - Start the simics installation by typing ../../scripts/start-simics -x abisko-sol10-cd-install1.simics and wait for the entire process to complete. A terminal from the target machine should appear and show you the progress of the OS installation. - During the installation, you may be asked to answer a few questions manually (since the Simics scripts are slightly out-of-date). You will get one question about NFS (just hit ESC-2 twice) and another on setting the root password (put whatever you want). You will also be asked to enable/disable remote services (select 'no'). - The entire installation may take several hours, depending on the performance of your host PC workstation. - When the script terminates, the installation from the first CD is finished, and Solaris will have tried to reboot the system. You will need to exit Simics at this point by hitting ''CTRL-C'' at the Simics console, and typing ''quit''. - Edit the ''abisko-sol10-cd-install2.simics'' script and set the proper ''$cdrom_path'' as before. Now run the second script by typing: ''../../scripts/start-simics -x abisko-sol10-cd-install2.simics''. During the 2nd script, you may be asked for additional input, such as the preferred keyboard type. At some point, you will be asked to select the media type. Choose 'CD/DVD'. - When the second script is finished, the Solaris installation will have tried to reboot the system. Like before, hit ''CTRL-C'' and type ''quit'' at the Simics console. - Start the third script by typing ''../../scripts/start-simics -x abisko-sol10-cd-install3.simics''. These should only take a few minutes to complete. Afterwards, you will be presented with a login prompt. Type ''root'' and the password you specified earlier. - The machine will shut down momentarily and at this point, a large Simics disk image called **abisko-sol10-install.disk** and a state file called **abisko-sol10.state** will have been created. After the machine shuts down, type ''quit'' at the Simics console. \\ **With a finalized disk image, we are now ready to boot the operating system and create our first Simics checkpoint.** - Open and edit the ''abisko-common.simics'' file and add the following lines near the top: $os = solaris10 $num_cpus = 4 $megs_per_cpu = 64 - These parameters allow us to configure the target machine at boot time according to our preferences. The design we will be demonstrating will be a 4-CPU system with a total of 256MB. These settings must match the capabilities of the FPGA platform that is used. In the case of XUPv5, the maximum # of CPUs we are able to support at the moment is 4, and the maximum amount of memory is 1.9GB (although Simics requires this to be a power of two, so 1GB is the true max). Note: it is recommended to go with the absolute minimum amount of memory needed for your application in order to reduce the amount of time it takes to stream the memory image over to the FPGA (at present, we can transfer at roughly 1-2MB/s over ethernet; future updates will include a faster PCI express interface). - Once you have edited the parameters, type ''../../scripts/start-simics -x abisko-common.simics'' to boot our machine. - A simulated terminal should appear and show the Solaris 10 boot process. - Once you reach the interactive terminal, we are now ready to save our first checkpoint. - Hit ''CTRL-C'' in the Simics console, and type ''write-configuration ''. It is recommended that you create a new folder (e.g., /home/checkpoints) to store your checkpoints. - Type ''quit'' to exit out of Simics. - To load up your checkpoint again, type ''../../start-simics''. Once you are at the Simics console, type ''read-configuration ''. You should see your simulated terminal re-appear where you last left it. **Note: as mentioned earlier, if you wish to change the # of CPUs and/or memory, you must edit the ''serengeti-6800-system.include'' file and follow the boot steps that was just mentioned.** As stated earlier, we are currently limited to 4 CPUs and only up to 1.9GB of memory (the Simics scripts may force to select power-of-two for main memory---so up to 1GB only). \\ ======Preparing a test workload====== In this section, we will cover the basics necessary to prepare a simple multithreaded microbenchmark for executing within the target system. This process of moving the workload into the target machine and executing until a breakpoint is usually carried out entirely within a Simics-only environment. The microbenchmark that we will be providing is a simple pthreads example that can be downloaded from {{:documentation:microbenchmarks.tgz|}}. Within the tarball, there are two source files: ''counter.c'', ''spinlock.c''. These two files have already been precompiled using a SPARC compiler and can be executed within the target machine. In the next step, we will implement the steps needed to move these files into the simulated target system. First, you will need to acquire the {{:documentation:simicsfs.iso.zip|simicsfs.iso}} file, which contains a cdrom image of the Simics files to facilitate target-to-host file transfers. - Start up a checkpoint that was saved out from the previous section (e.g., start-simics ). At the Simics console, type ''new-file-cdrom simicsfs.iso'' - Then type ''cd0.insert iso0'' - Type ''c'' to begin simulating at the console. You may need to wait a few minutes until the simulated cdrom drive has loaded the image. - Once you have done this, navigate to ''/cdrom/cdrom0'' within the target machine. You will see several files named ''mount_simicsfs'' and ''simicsfs-sol*''. - Type the following commands below: bash mkdir -p /usr/lib/fs/simicsfs cp /cdrom/cdrom0/mount_simicsfs /usr/lib/fs/simicsfs/mount cp /cdrom/cdrom0/simicsfs-sol10 /usr/kernel/fs/sparcv9/simicsfs export TERM=vt100 vi /etc/vfstab * Inside the vfstab file, add a new line to the very end (with each entry tab-delimited): simicsfs - /host simicsfs - no - * Type '':wq'' to save the file and exit. * Type ''mkdir /host'' * This is usually a good time to save out a checkpoint right before you mount the host file system. At the Simics console, type ''CTRL-C'' followed by something like ''write-configuration /'' * Type ''c'' at the Simics console to resume. * Within the simulated console, type ''mount /host'' * Type ''ls /host'' to see the underlying host machine's root directory At this point, you should place the microbenchmark files somewhere on the host machine and copy them over to the target machine. Save out a checkpoint again and quit out of Simics. In this next section, we will create a Simics script that will allow us to detect breakpoints inserted within our application in order to stage the workload. A breakpoint (also known as a 'magic breakpoint' in Virtutech parlance) is simply a predefined assembly instruction inlined into your code. This instruction usually has no effect (e.g., a write to register 0) but is recognized by Simics. You can take a look at all the magic breakpoint instructions within the ''magic-instruction.h'' file within the microbenchmarks tarball downloaded earlier. - Create a new Simics script called break.simics and fill it in with this: @def hap_callback(user_arg, cpu, arg): if arg == 1: SIM_break_simulation("Entered main()") if arg == 2: SIM_break_simulation("First thread spawned") @SIM_hap_add_callback("Core_Magic_Instruction", hap_callback, None) read-configuration - Launch Simics by typing ''start-simics break.simics'' - Within the simulated console, navigate to the directory where you copied over the microbenchmark files. - Type: ''./spinlock 4 1000 10 10 0'' - Simics should immediately break to the console and output ''Entered main()'' - Typing ''c'' again will break once the first thread reaches the beginning of its handler - You can see how the source code inserts the magic instructions by looking at ''spinlock.c'' - **Save out a final checkpoint** - **FINAL STEP**. This final step is needed to maximum the performance of the underlying simulated I/O system. Simics is typically the initiator of DMA transactions, which occur at some bulk-sized granularity. This granularity is set by default to a very low value (64 Bytes) in default Simics checkpoints. Since Simics is a software-based simulator, issuing many small bulk transfers imposes no simulation overhead. In our system, large bulk transfers are far more desirable. To change this default setting, you will need to **EDIT** the checkpoint file and make one small change. - Type the following commands: cd perl -pi -e 's/dma_block_size: 64/dma_block_size: 8192/' \\ ======Validating a Workload for ProtoFlex====== Prior to loading any Simics checkpoints into the ProtoFlex simulator, it is necessary to verify and see if the checkpoint has any transient state that cannot be loaded into FPGA hardware. For example, Simics allows a checkpoint to be saved while a pending interrupt is queued up for a processor (or if a DMA transaction is waiting on the event queue). To check against this, you should run this script prior to loading any Simics checkpoint: checkpfckpt If there are no errors, the script will return with no messages. If there are reported problems, the solution is to load up the checkpoint and advance its state by some amount of time and saving out a new checkpoint. This usually allows the transient operations (e.g., DMA, interrupts) to complete. In I/O-intensive applications, this may take several tries before you can get the system to be "quiet". \\ ======Generating the bitstream====== In this section, we will cover the basic steps needed to generate the bitstream file that will be used to program the XUPV5-LX110T FPGA. The top level project we use is a modified version of an XUPv5-LX110T reference design (taken from http://www.xilinx.com/univ/xupv5-lx110t-refdes.htm) based on the Xilinx Embedded Development Kit 10.1 (EDK) tool chain. In our design, we have created our own ''pcore'' (in Xilinx parlance), which is an IP block that contains a multithreaded UltraSPARC III core called the **BlueSPARC**. BlueSPARC is written using a high-level, synthesizable hardware description language called Bluespec SystemVerilog (BSV). The BSV compiler takes our Bluespec description in the form of ''*.bsv'' files and generates purely synthesizable Verilog code. In our flow, once this Verilog code is generated, we then synthesize it into an .NGC file using Xilinx XST 10.1. This .NGC file is then imported into a template ''pcore'', which is then inserted into our EDK project. Once we have done this, we simply "press a button" and EDK will generate a bitstream for us that can be programmed onto the FPGA. The process of generating the bitstream typically takes several hours. For demonstration purposes, you can skip this step by using our pre-generated bitstream files saved under . ====Generating and synthesizing RTL on the Primary PC==== - To generate the UltraSPARC III core model (BlueSPARC) used in the simulator, navigate over to the RTL directory at: ''/rtl/bluesparc''. - Typing **make xupv5_top** will invoke the Bluespec compiler and generate the output Verilog files under the ''/rtl/bluesparc/build'' directory. Generating the Verilog files on a Core Duo 2 E7500 @ 2.93GHz should take 15 minutes. - Once the Verilog files are generated, the Makefile will automatically invoke the synthesizer (Xilinx XST). After 45 minutes, the final netlist will be stored under the ''./xst_runs/mkBluesparc_64to32_'' sub-folder. - Afterwards, it is necessary to generate the EDK project that will be used to produce the bitstream for the XUPV5 FPGA. Navigate over to ''/platforms'' directory and type: ''make xupv5''. You will be asked to overwrite files (hit 'Y') and to enter a short description of the build (this is recommended to keep track of multiple builds, if necessary). - Once you hit enter, a new folder in the format of ''/platforms/build/xupv5--'' will automatically be generated. - The **FINAL** step is to copy over the NGC file into the generated EDK project. Example: **''cp /rtl/xst_runs/mkBluesparc_64to32_09-18_1724/mkBluesparc_64to32.ngc /platforms/build/xupv5-001-Sep-18/pcores/bluesparc_v1_00_0/netlist/''** - Open and build the EDK project at the command-line (example): cd /platforms/build/xupv5-001-Sep-18 xps -nw xupv5.xmp % run init_bram * When this step is completed (about 3 hours), a final bitstream file will be located under ''/platforms/build/xupv5-001-Sep-18/implementation/download.bit''. There will also be an ELF executable file saved under ''/platforms/build/xupv5-001-Sep-18/pfserver.elf''. **Copy these two files to the Secondary PC**. \\ ======Preparing the PCI express driver on the Primary PC====== * To compile the PCI express linux drivers on the Primary PC, you must download the linux kernel source. This can be achieved by: - Typing ''/sbin/yast2'' at the command-line (under root) - Navigate over to Software Management - Search for and install the package called "kernel-source". You must ensure that the kernel sources match up with your version of the kernel (you can check your version of the kernel by typing ''uname -a'' at the command-line). We have officially tested 2.6.27.29-0.1. If your kernel and the kernel sources are not the same, the easiest way to get them synced up is to install the kernel-base package (also within yast), which will rev up your linux kernel to the latest version. **Make sure to reboot your system and to pick the new kernel at the GRUB menu after doing this**. * The next step is to build the Linux PCI express driver that allows the FPGA/XUPv5 to communicate with the Intel CPU host through main memory. Navigate over to ''/drivers/xupv5_pcie/module'' and type ''make''. You should make sure that there are no compiler errors. * We will cover a few more additional steps needed for the PCI express in the next section. \\ ======Downloading the bitstream to FPGA====== In this section, we will start by programming the XUPv5-LX110T with our generated bitstream. Before continuing, we first conceptually describe how the FPGA component operates. The FPGA bitstream implements a system-on-chip that contains the BlueSPARC core as well as a Microblaze used to facilitate communication with the Linux PC workstation. The Microblaze runs a bare-metal C application called ''pfserver'' which simply runs a while(1) read(..) loop that processes incoming messages from the PC workstation. From both ends, this abstraction is implemented as a sockets-like (''put'' and ''get'') abstraction over PCI express. On the PC-side, a software program called ''pfmon'' is the top-level controller that issues commands and queries over PCI express to the ''pfserver'' program running on the Microblaze. Apart from simply communicating with the PC workstation, the Microblaze plays an important role in communicating directly with the BlueSPARC core over a fast, processor local bus (PLB). The Microblaze issues push/pop commands over the bus to the core in order to initialize or query its state. Although we will not discuss in detail, the BlueSPARC core also occasionally issues requests to the Microblaze to software-simulate certain instructions that are not implemented in hardware. In the following steps, we will first program the FPGA with our generated bitstream, and then download the ''pfserver'' application onto the Microblaze core running next to the BlueSPARC. Multiple tools can be used to configure the FPGA, such as Impact or Chipscope. In this tutorial we will be using Impact to configure the FPGA and XMD to load the Microblaze executable into memory. === Using HyperTerminal on the Secondary PC to connect to the FPGA serial port === In order to get status messages from the FPGA you need to connect the Secondary PC to the FPGA board through a serial link. For this you will need a female-to-female null-model serial cable. First open HyperTerminal on the Secondary PC (assumed to be running Windows XP) by clicking ''Start-->All Programs-->Accessories-->Communications-->HyperTerminal''. Then type a name for the connection (e.g. ''XUPv5'') and hit OK. In the bottom drop-down menu select the COM port where you attached the serial cable (usually COM1) and hit OK. Now select ''9600'' for the ''Bits per second'' option and ''None'' for the ''Flow control'' option and hit OK. You are now connected to the FPGA serial port. (nothing should appear yet). === Programming the FPGA from the Secondary PC === * At this point, you should have copied the ''download.bit'' and ''pfserver.elf'' files from the Primary PC to somewhere on the Secondary PC (we assume ''C:\''). * Power off the Primary PC. Power on the XUPv5 board using the external AC adapter. Make sure the JTAG unit is connected to the XUPv5 and to the Secondary PC. * Open up Impact from the start menu. When asked, create a new project and click ''Finish'' to start the boundary scan. This will detect 5 components on the JTAG chain, with the last component being the FPGA. When prompted by an Open Window dialog, click ''Cancel all''. * Right-click on the last component (xc5vlx110t), and click ''Assign New Configuration File''. When prompted, select the ''download.bit'' file copied over from the Primary PC. Leave all default options and proceed with the programming. This should only take a few seconds. * Open a Cygwin command prompt by navigating to ''Start->Programs->Xilinx->EDK->Accessories->Launch EDK Shell''. Within the shell, navigate over to the directory where you saved ''pfserver.elf''. If we assume that these files were on the ''C:\'' drive, type: cd /cygdrive/c xmd % connect mb mdm % dow pfserver.elf % con * At this point you should observe output in the HyperTerminal window indicating that the XUPv5 hardware is ready. In the XMD shell the ''con'' command runs the Microblaze program (pfserver) and the ''stop'' command is used to halt execution. For more information on other xmd commands type ''help'' in the XMD shell. * We are now ready to boot the Primary PC and connect to the FPGA over PCI express. === Configuring the PCI express driver === * Power on the Primary PC. **Don't forget to select the correct kernel at the GRUB menu (we use 2.6.27.29-0.1)**. * During bootup, your startup screen should show the FPGA board as a ''Memory Controller'': {{:documentation:bios.png?350|Bootup Screen}} * After the Primary PC is booted, navigate over to the ''/drivers/xupv5_pcie/module'' directory. Load the driver by typing ''sudo make load''. * This next step is slightly inconvenient as it may require some manual effort. Type ''cat /proc/devices''. You should observe some output as shown below: Character devices: 1 mem 4 /dev/vc/0 4 tty 4 ttyS 5 /dev/tty 5 /dev/console 5 /dev/ptmx 7 vcs 10 misc 13 input 14 sound 21 sg 29 fb 99 ppdev 116 alsa 128 ptm 136 pts 180 usb 189 usb_device 216 rfcomm 226 drm 249 xupv5 250 rtc 251 hidraw 252 usb_endpoint 253 bsg 254 perfmon * Look for the line that says ''### xupv5''. Edit the ''Makefile'' and look for the command that says ''mknod /dev/xupv5 c 249 0''. If the ''249'' number does not match up with what you saw under ''/proc/devices'', you will need to change that line to the correct number. * If a change is needed, you should run the following: rm -rf /dev/xupv5 mknod /dev/xupv5 c 0 * Last but not least, when you have completed loading the driver, you should type ''dmesg''. Your output should resemble something like this (the addresses may not match up exactly): xupv5: module license 'unspecified' taints kernel. xupv5_module_init(395): Initialization vendor=8086 device=27d0 xupv5_pcird 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 xupv5_probe(178): BAR0 length: 1024 xupv5_probe(180): BAR0 physical address: e1000000 xupv5_probe(183): BAR0 virtual address: f91b0000 xupv5_probe(255): Probe completed * This message will only show up once the PCI express detects the XUPv5 FPGA board. \\ ======Running the PFMON Tool====== * In this section, we will cover the basic process of starting up the ProtoFlex simulator and running the test workload you created earlier. * A ProtoFlex simulation is initiated and controlled by a software tool called PFMON, which is a top-level controller that orchestrates interprocess communication between the Simics modules and the actual hardware running on the FPGA. PFMON is mainly operated through scripts and a command-line interface. * Before we start, it is necessary to create a memory cache directory (this is a cache of memory images dumped out by Simics for loading into the FPGA's memory). This can be done by typing: ''mkdir -p ~/imgcache''. * At the command-line, start by typing: ''pfmon -job -defaulthw fpga_pcie''. The '''' parameter should point to the checkpoint file that you created earlier while bootstrapping the ''spinlock'' microbenchmark. The ''defaulthw'' parameter is used to select the default hardware platform to use (other meaningful options include fpga (deprecated) and pli (for verilog simulations)). **Please remember to run the "''checkpfckpt ''" command prior to this step as explained in the earlier section on "Creating a Test Workload"**. * You should be greeted with a startup message by PFMON and a command-line. * Type the following commands: connect -hw simcpu connect -hw simdev connect -hw default select -dev simdev0 timer -hsrc simdev0 reginit -hs simcpu0 memdump -hw simcpu0 -path /home/imgcache memload -path /home/imgcache setcpu -en * The ''connect'' commands as shown above are used to initialize and establish a connection between various platforms used throughout our system. Specifically, the ''connect -hw simcpu'' and ''connect -hw simdev'' commands will instantiate the Simics processes in the background that will be responsible for providing initial checkpoint state as well as facilitating simulated I/O devices. * The ''connect -hw default'' is synonymous with ''connect -hw fpga_pcie'' (as was set in the initial launch command). This command establishes a connection between the Primary PC and the server code that runs on the microblaze within the XUPv5 FPGA. * The ''select -dev'' command is simply for convenience and reduces the # of arguments that have to be passed into subsequent commands that involve devices. * The ''timer'' command programs the default hardware (in this case ''fpga_pcie'') with the expected rate at which the CPU/System timers (i.e., TICK, STICK) in the target system should advance. * The ''reginit'' commands copy over the full register file + TLBs of each simulated CPU over from Simics to the FPGA. * The ''memdump'' command instructs Simics to generate a binary image of the target system's main memory when the checkpoint was taken. The ''-path'' command dumps the image to a cached directory (this avoids repeating this step each time the simulation is started up). * The ''memload'' command searches the cache directory and initializes the FPGA's memory system with the binary image. This process typically takes 60-70 seconds for a 1GB memory target system. * The ''setcpu'' command enables particular CPUs for running. * The ''stats -reset'' command is explained in the Section on ''Statistics'' below. * **Note, all of these commands can be placed into a script file and passed into pfmon without re-typing them each time**. For example, if the above commands were pasted into a file named ''connect.scr'', then one could simply type: ''pfmon -job -defaulthw fpga_pcie -script connect.scr''. Commands can also be commented out using the ''#'' character preceding any given command. * There are a range of typical problems that may occur during startup. If your Simics license was not configured properly, the ''connect'' commands may appear halted forever. Or if the FPGA is not properly configured, then the ''connect -hw default'' command may issue warnings/errors. If you get into trouble, you should navigate over to the log directory that gets created for each run instance. The directory is typically displayed (''Work directory'') once pfmon starts up and is usually of the form: ''/home//pflogs/-''. The ''cli.log'' file typically shows the commands you typed and the output. The ''simdev0.log'' and ''simcpu0.log'' are usually the first places to look if you encounter any trouble. If for whatever reason pfmon crashes out unexpectedly, you may need to issue a ''ps -aux'' command and look for any stray processes that need to be killed (example below shows the various processes that get launched). 4643 19.1 0.0 7412 1620 pts/4 R+ 18:35 12:14 pfmon -job /home/pf_user/checkpoints/spec2k-4cpu-1gb-ready -defaulthw fpga_pcie -script scripts/connect.scr 4644 0.0 0.0 4196 1372 pts/4 S+ 18:35 0:00 /bin/sh /home/pf_user/protoflex/modules/simics_remote_ctrl/simics_listener/run_simics_cpus.sh 4646 0.0 2.7 100096 85260 pts/4 Sl+ 18:35 0:01 /home/pf_user/simics-3.0.22/x86-linux/bin/simics-common -no-win -stall -x launch_cpus.simics 4664 0.0 0.0 4196 1368 pts/4 S+ 18:35 0:00 /bin/sh /home/pf_user/protoflex/modules/simics_remote_ctrl/simics_listener/run_simics_devices.sh 4666 6.3 2.8 101860 87088 pts/4 Sl+ 18:35 4:03 /home/pf_user/simics-3.0.22/x86-linux/bin/simics-common -stall -x launch_devices.simics * At this point, we should be ready to begin executing our first simulation. To begin, type: ''run -n 10000000 -q 1000000''. This command will instruct the FPGA platform to begin executing 10 million instructions. * The ''-n'' argument specifies the total number of instructions that are to be executed across all CPUs that are enabled. The ''-q'' command is also in units of instructions and simply indicates how frequent pfmon should halt the simulation on the FPGA and issue probes to the hardware. Having periodic "breaks" also allows us to halt the FPGA on-demand using ''CTRL-C'' if necessary. For example, a typical way to execute 1 billion instructions would be: ''run -n 1000000000 -q 10000000''. Having a ''-q'' value will allow us to monitor the state of the simulation more frequently at the expense of performance overhead. * Once the simulation is running, you will notice a few statistics being updated in real time, for example: 10850M/100000000M 813s avgmips:13.3 [probe:23457 mtp:5843799 ior:19647 iow:4277 irpt:499 dma-i:288kB dma-o:18423kB] * A complete run from beginning to end is shown below: pfmon v0.3 last rev: 7/1/09 Type 'help' Work directory: /home/pf_user/pflogs//spec2k-4cpu-1gb-ready_263_183535 pfmon> connect -hw simcpu Successful simics interface registration Waiting for connection to simics... Successful connection simcpu0 created pfmon> connect -hw simdev Successful simics interface registration Waiting for connection to simics devices... Successful connection simdev0 created pfmon> connect -hw default -ip 192.168.1.10 Successful fpga interface registration Connecting over PCI express... Opening PCIE fpga_pcie0 created fpga_pcie0 set as default hw pfmon> select -dev simdev0 selecting simdev0 as default device instance pfmon> timer -hsrc simdev0 programming cpu timers (stick ratio: 6) pfmon> reginit -hs simcpu0 Setting # of cpus for fpga_pcie0 to 4 loaded from loaded from loaded from loaded from pfmon> memdump -hw simcpu0 -path /home/pf_user/imgcache /home/pf_user/imgcache/_home_pf_user_checkpoints_spec2k-4cpu-1gb-ready.img already exists. pfmon> setcpu -en enabling enabling enabling enabling pfmon> memload -path /home/pf_user/imgcache |==================================================| 100% of 1024MB loaded memory image from /home/pf_user/imgcache/_home_pf_user_checkpoints_spec2k-4cpu-1gb-ready.img loaded into fpga_pcie0 (72s) pfmon> stats -reset Statistics reset pfmon> step -n 100000000000000 -q 10000000 fpga stepping 100000000000000 instructions 10850M/100000000M 813s avgmips:13.3 [probe:23457 mtp:5843799 ior:19647 iow:4277 irpt:499 dma-i:288kB dma-o:18423kB] \\ ======Statistics====== * At the PFMON command-line, you can view various runtime statistics by typing: ''stats'' * To view stats that are specific to a single CPU, type ''stats -cpu '' where '''' is from 0 to N-1 CPUs available in your Simics checkpoint. * To reset statistics, type ''stats -reset''. Note: resetting the performance counters in hardware is currently unsupported. The ''reset'' command is implemented in software by subtracting out an initial number of counts read out from the FPGA. It is important to remember this if you are planning to add your own instrumentation. * **Some of the statistics below show only zeros, e.g., total # branches**. This is because we did not enable branch profiling during compile-time. To view these statistics, it is necessary to re-build the RTL with the desired options enabled. Please see the section further below on ''Compile-time Options''. ========================== Aggregate BlueSPARC statistics =========================== Unless otherwise noted, % values in parenthesis indicate rate of the event per total # of instructions cycles: 10526913715 // total # of cycles (this is start & stopped during 'step' commands) stalls: 12586525 (0.120%) // total # cycles stalled due to resource hazard (does not include memory stalls) instructions: 1570000000 // total # instructions executed stalls per 100 inst: 0.8 privileged insts: 1306905953 (83.242%) // total # privileged instructions executed cpu progress breakdown: // percentage of instructions executed by specific CPUs cpu 0 (30.0%) cpu 1 (33.5%) cpu 2 (9.4%) cpu 3 (27.1%) aggregate ipc: 0.149 // average IPC of the BlueSPARC pipeline micro-transplants: 1078068 (0.068667%) // # micro-transplants executed by the Microblaze pipeline retries: 9248885 (0.589%) // # aborted instructions (e.g., due to resource hazard) assist instructions: 21044559 (1.340%) // # micro-instructions used to facilitate complex instructions fetches: 1570000000 // # SPARC instructions fetched and executed fetch misses: 18456469 (1.176%) // # BlueSPARC I-cache misses stores: 49057631 (3.125%) // # store instructions store misses: 2202635 (0.140%) // # store misses loads: 165757375 (10.558%) // # load instructions load misses: 15656816 (0.997%) // # load misses interrupts recv'd: 1271 (0.000081%) // total # of interrupts device interrupts: 25 (0.000002%) // # device interrupts cpu cross-calls sent: 1246 (0.000079%) // # cpu-to-cpu interrupts cross-calls aborted: 208561 // # cpu-to-cpu interrupts that aborted due to busy CPU i/o reads: 147 (0.000009%) // # of memory-mapped I/O reads i/o writes: 159 (0.000010%) // # of memory-mapped I/O writes simics i/o cnt: 306 // total # I/Os simics i/o lat (us): 1544 // average latency of Simics I/O transplant (in microseconds) simics lat (us): 1108 // average latency (Simics-only overhead) flushes: 107682 (0.006859%) // total # of i- and d-cache flushes tick interrupts: 0 // # interrupts generated by TICK register stick interrupts: 1488 // # interrupts generated by STICK register illtraps: 0 (0.000000%) // # illegal traps (should be 0 otherwise something is wrong) fp_disabled: 0 (0.000000%) // # floating-point disabled traps fetch_align: 0 (0.000000%) // # misaligned fetches (should be 0) privileged_op: 0 (0.000000%) // # trapped non-privileged accesses total # branch: 0 (0.000000%) // # of branch instructions (requires OPT_BRANCH_STATS = True) # taken branch: 0 (0.000000%) // # taken branches (same as above) total # priv branch: 0 (0.000000%) // # of branches in privileged mode # taken priv branch: 0 (0.000000%) // # of taken branches in privileged mode \\ ======Compile-time Options====== * There are a number of RTL compile-time options for the BlueSPARC core that can be used to include or omit instrumentation features. Low-level microarchitectural settings can also be changed (for developers only). * The main file that controls all the compile-time RTL configuration settings: **''/rtl/bluesparc/rtl/Configs/Defs.include''** * **Note: any changes to the configuration file will require a re-compile/re-synthesis of the BlueSPARC core**. * **Application options:** `define OPT_EVENT_COUNTS True // basic statistics `define OPT_ADV_EVENT_COUNTS True // more detailed statistics `define OPT_BRANCH_STATS False // enables branch profiling & statistics `define OPT_TIMING_PROFILE False // used to generate timing breakdowns `define OPT_CHIPSCOPE False // enable chipscope debugging wires `define OPT_SIU_COUNTS False // enable counting of special instructions `define OPT_TRACE_CMP False // enable TraceCMP modeling (BEE2 only) `define OPT_TRACE_BPRED False // enable TraceBranchPred modeling (BEE2 only) `define OPT_FLIGHT_DATA_RECORDER False // enable Flight Data Recorder (BEE2 only) `define OPT_MTP_BARRIER False // allow serializing of pipeline on micro-transplants `define OPT_IO_BARRIER False // allow serializing of pipeline on I/O transplants `define OPT_ASSERTIONS True // enable hardware assertions * **Changing CPU options should be left for advanced developers only (e.g., porting to a new platform or optimizing).** `define NUM_CONTEXTS 4 // Maximum # of physical CPU contexts (should be left at 4 for XUPv5-LX110T) `define OPT_LRAM_SIZE 16 // Optimal LUTRAM size (4-input LUT = 16, 6-input LUT = 64) // LUTRAMS should generally be used for small-CPU configurations (4 CPUs or below) // Setting to FALSE uses BRAMs, which are more efficient for large-CPU configurations (8 CPUs or above) `define ITLB_FA_USE_DISRAM True // Use LUTRAMs for 16-way fully-associative I-TLB `define DTLB_FA_TAG_USE_DISRAM True // Use LUTRAMs for 16-way fully-associative D-TLB tags `define DTLB_FA_DATA_USE_DISRAM True // Use LUTRAMs for 16-way fully-associative D-TLB data `define DTLB_FA_TAG_REPLICA_USE_DISRAM True // Use LUTRAMs for 16-way fully-associative D-TLB tag replica `define SIU_ALT_FILE_USE_DISRAM True // Use LUTRAMs for alternative register file in special instruction unit `define MMU_FILE_USE_DISRAM True // Use LUTRAMs for the MMU register file `define ITLB_W0_USE_DISRAM False // Use LUTRAMs for (# cpus) x 64-entry I-TLBs `define ITLB_W1_USE_DISRAM False // same as above (for the other 'way' of the 2-way set) `define DTLB_W0_USE_DISRAM False // Use LUTRAMs for (# cpus) x 256-entry D-TLBs `define DTLB_W1_USE_DISRAM False // same as above (for the other 'way' of the 2-way set) `define TRAP_STATE_USE_DISRAM True // Use LUTRAMs for trap state `define ICACHE_USE_QBRAM False // Use quad-pumped BRAMs for i-cache data array (experimental, untested) `define DCACHE_USE_QBRAM False // Use quad-pumped BRAMs for d-cache data array (experimental, untested) `define RF_USE_QBRAM False // Use quad-pumped BRAMs for register file (experimental, untested) `define CACHE_BLK_SIZE 512 // block size in bits (cannot be set smaller than 512 or else processor will have bugs) `define ICACHE_SIZE 32768 // I-cache size in bytes `define DCACHE_SIZE 32768 // D-cache size in bytes \\ ======Verilog Simulations====== * For developers interested in simulating their own designs, we provide a baseline Verilog software platform module that substitutes for the FPGA. Invoking this platform is simply a 1-line change in the PFMON command-line invokation: ''pfmon -job -defaulthw **pli**''. * The Verilog platform module is generated using the Synopsys VCS Verilog Compiler installed on the Primary PC (tested using version Y-2006.06). We compile the source RTL files (generated by Bluespec) into a C executable file. To preserve the PFMON command-line abstraction and to simplify the process of driving the simulation with a Simics workload, the Verilog platform module wraps around the source RTL files and provides a FIFO-like interface via Named Pipes (http://linux.about.com/library/cmd/blcmdl4_fifo.htm) over PLI (http://www.asic-world.com/verilog/pli.html). * The top-level file that contains the wrapper and the instantiation of the BlueSPARC core is at ''/modules/pli_remote_ctrl/pli_listener/pli_listener.v'' * Conceptually, the process of starting up the simulation and driving it with a workload is no different from what we showed you earlier in order to run a workload on the FPGA (except several orders of magnitude slower). In the few steps below, we will illustrate how to prepare and run the Verilog simulation. * The directory that hosts the relevant files are kept under: ''/modules/pli_remote_ctrl/pli_listener'' * Before continuing, navigate over to the ''/rtl/bluesparc directory'' and type **''make top_pli''**. Check to make sure that there are no compiler errors and that Verilog files have been generated under the ''/rtl/bluesparc/build'' directory. * Type ''cd /modules/pli_remote_ctrl/pli_listener'' then type ''make''. Verify that ''/home//vlog/pli_listener'' was generated. * To begin the simulation, navigate to ''/apps/pfmon'' * Type ''pfmon -job -defaulthw pli'' * Type ''step -n 1000 -q 100'' * Type ''quit'' * Navigate to the workspace directory under ''/home//pflogs/-'' * A file named ''pli.log'' should have been generated. This file contains human-readable runtime traces generated via ''$display'' statements implemented within the Bluespec code. (More details on interpreting the traces will be added to this document in the future.) * Example output log: pfmon v0.3 last rev: 7/1/09 Type 'help' Work directory: /home/pf_user/pflogs//db2-boot-tpcc-4cpus-128mb_263_215907 pfmon> connect -hw simcpu Successful simics interface registration Waiting for connection to simics... Successful connection simcpu0 created pfmon> connect -hw simdev Successful simics interface registration Waiting for connection to simics devices... Successful connection simdev0 created pfmon> connect -hw default Successful pli interface registration Waiting for connection to pli... Successful connection Initializing memory interchip Programming forward progress trackers Programming retry delay to 0 pli0 created pli0 set as default hw pfmon> select -dev simdev0 selecting simdev0 as default device instance pfmon> timer -hsrc simdev0 programming cpu timers (stick ratio: 6) pfmon> reginit -hs simcpu0 Setting # of cpus for pli0 to 4 loaded from loaded from loaded from loaded from pfmon> memdump -hw simcpu0 -path /home/pf_user/imgcache /home/pf_user/imgcache/_afs_scotch_project_workload_images_db2_v8_db2-boot-tpcc-4cpus-128mb.img already exists. pfmon> setcpu -en enabling enabling enabling enabling pfmon> memload -path /home/pf_user/imgcache memory image from /home/pf_user/imgcache/_afs_scotch_project_workload_images_db2_v8_db2-boot-tpcc-4cpus-128mb.img loaded into pli0 (0s) pfmon> step -n 1000 -q 100 pli stepping 1000 instructions 1000/1000 20s avgkips:0.1 [probe:20 mtp:0 ior:0 iow:0 irpt:1 dma-i:0kB dma-o:0kB] * To generate VCS dumps that can be processed by a waveform viewer (e.g., ModelSim), add these lines to the ''/modules/pli_remote_ctrl/pli_listener/pli_listener.v'' file: $dumpfile("waveforms.dump"); // save waveforms in this file $dumpvars (0, pli_listener); // saves all waveforms * The VCS dumps will be saved under ''/home//vlog/waveforms.dump'' * Additional note: at the PFMON command-line in PLI mode, the ''stats'' command should also work like usual. \\ ======Known issues and limitations====== * Current release design has an under-clocked DDR2 controller at 100MHz. We have internally developed a faster version, but this has not been released yet. * The ''svc.configd'' daemon that runs immediately after Solaris 10 boots up generates a high # of microtransplants in our system (floating-point instructions). This can create the appearance that the workload or target command-line is halted/frozen. One way to avoid this situation is to kill the daemon in the simulated system in a Simics-only simulation (find the pid by typing ''prstat''). One can also simulate for a longer period of time after bootup (until the svc daemon finishes initialization). * In Solaris 10, you may receive a console warning that says: ''abisko genunix: WARNING: Time of Day clock error: [Changed in Clock Rate]. -- Stopped tracking Time of Day clock.'' This issue apparently also manifests if you run OpenSolaris under a VM (https://opensolaris.org/jive/message.jspa?messageID=26724). We are looking into this issue. * When typing in the target console at the command-line, it is necessary to type slowly/cautiously to avoid overflowing (and therefore crashing) the console (i.e., wait for the typed character to appear before continuing typing). This is a known issue with the Simics console that also happens in software-only instances of Simics. * The statistics shown in PFMON do not get reset in BlueSPARC between simulation runs. Only a fresh bitstream download (or system-wide reset) can do this. * Certain bulk memory instructions that we support currently requires a minimum of 64B cache blocks. Configuring BlueSPARC with a smaller cache block size will result in a failing design since some instructions will no longer be able to atomically access 64 bytes at a time (these are the Block LDD/STD instructions). * pfmon gets unstable if the ''connect hw fpga_pcie'' command is issued before ''connect hw simcpu'' and ''connect hw simdev''. We are looking into this issue. * Simulated ethernet devices have not been tested and are currently unsupported. * The synchronization counters in the PCI express code are only 32 bits wide, running a simulation for several days could cause them to overflow. This is an easy fix, we have not gotten around to it yet. * Saving out Simics checkpoints from ProtoFlex has not been ported from the BEE2 to XUPv5 yet. * The Microblaze code may emit warnings at times, these can be safely ignored unless you see an assertion. Please report these to us. * The PCI express driver does not properly unload. Reloading the driver will require you to reboot the Primary PC. * At present, there is a 4-CPU limitation (as explained in the limitations section). We are also only currently limited to simulating 1GB Simics workloads on the XUPv5. * There is no floating point unit (although FP instructions are supported 'slowly' via simulation on the Microblaze). * (internal) Flight data recorder in simulation has a format that's fixed for 16 CPUs (specifically, the FlightRec_t struct) --> needs to be fixed * (internal) BEE2 user_logic.v has CLK_2x tied to 0, so don't use double-clock BRAMs. \\ ======References====== **{{:documentation:mc09.pdf|ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs}}**\\ Eric S. Chung, Michael K. Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, and Ken Mai.\\ //ACM Transactions on Reconfigurable Technology and Systems//, 2009.\\ **{{http://www.ece.cmu.edu/~echung/memocode-camera.pdf|Implementing a High-performance Multithreaded Microprocessor: A Case Study in High-level Design and Validation}}**\\ Eric S. Chung and James C. Hoe.\\ //Formal Methods and Models for Codesign (MEMOCODE)//, July 2009.\\ **{{http://www.ece.cmu.edu/~echung/fpga08-chung.pdf|A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs}}**\\ Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, and Ken Mai.\\ //International Symposium on Field Programmable Gate Arrays//, February 2008, Monterey, CA.\\ \\ ======Resources====== * Support email: [[protoflex@ece.cmu.edu]] * Bluespec forum: http://www.bluespec.com/forum * XUPv5 reference pages: http://www.xilinx.com/univ/xupv5-lx110t.htm * {{:documentation:usiiiv2.pdf|UltraSPARC III Cu Reference Manual}} \\