CryptoURANUS Economics: 10/14/20


Wednesday, October 14, 2020

FPGA XCV7-690T/HTG700 Hourly-Data-Updates

FPGA XCV7-690T/HTG700 Hourly-Data-Updates

Xilinx ISE-LabTools Tutorial

Xilinx Tools Tutorial



The Xilinx Integrated Software Environment (ISE) is a powerful and complex set of tools. The purpose of this guide is to help new users get started using ISE to compile their designs. This guide provides a very high-level overview of how the tools work, and takes the reader through the process of compiling. The ultimate reference to ISE is of course the official documentation, which is installed on every PC in the lab, and is available from the Xilinx website. Because the documentation is so voluminous, this guide will attempt to provide some help with finding the right sections of the documentation to read. Don't miss the required reading section at the end of this guide which points out some sections of the documentation that every 6.111 student should read before beginning a complex labkit project.

From HDL to FPGA

The process of converting hardware design language (HDL) files into a configuration bitstream which can be used to program the FPGA, is done several steps.
First, the HDL files are synthesized. Synthesis is the process of converting behavioral HDL descriptions into a network of logic gates. The synthesis engine takes as input the HDL design files and a library of primitives. Primitives are not necessarily just simple logic gates like AND and OR gates and D-registers, but can also include more complicated things such as shift registers and arithmetic units. Primitives also include specialized circuits such as DLLs that cannot be inferred by behavioral HDL code and must be explicitly instantiated. The libraries guide in the Xilinx documentation provides an complete description of every primitive available in the Xilinx library. (Note that, while there are occasions when it is helpful or even necessary to explicitly instantiate primitives, it is much better design practice to write behavioral code whenever possible.)
In 6.111, we will be using the Xilinx supplied synthesis engine known as XST. XST takes as input a verilog (.v) file and generates a .ngc file. A synthesis report file (.srp) is also generated, which describes the logic inferred for each part of the HDL file, and often includes helpful warning messages.
The .ngc file is then converted to an .ngd file. (This step mostly seems to be necessary to accommodate different design entry methods, such as third-part synthesis tools or direct schematic entry. Whatever the design entry method, the result is an .ngd file.)
The .ngd file is essentially a netlist of primitive gates, which could be implemented on any one of a number of types of FPGA devices Xilinx manufacturers. The next step is to map the primitives onto the types of resources (logic cells, i/o cells, etc.) available in the specific FPGA being targeted. The output of the Xilinx map tool is an .ncd file.
The design is then placed and routed, meaning that the resources described in the .ncd file are then assigned specific locations on the FPGA, and the connections between the resources are mapped into the FPGAs interconnect network. The delays associated with interconnect on a large FPGA can be quite significant, so the place and route process has a large impact on the speed of the design. The place and route engine attempts to honor timing constraints that have been added to the design, but if the constraints are too tight, the engine will give up and generate an implementation that is functional, but not capable of operating as fast as desired. Be careful not to assume that just because a design was successfully placed and routed, that it will operate at the desired clock rate.
The output of the place and route engine is an updated .ncd file, which contains all the information necessary to implement the design on the chosen FPGA. All that remains is to translate the .ncd file into a configuration bitstream in the format recognized by the FPGA programming tools. Then the programmer is used to download the design into the FPGA, or write the appropriate files to a compact flash card, which is then used to configure the FPGA.


By itself, a Verilog model seldom captures all of the important attributes of a complete design. Details such as i/o pin mappings and timing constraints can't be expressed in Verilog, but are nonetheless important considerations when implementing the model on real hardware. The Xilinx tools allow these constraints to be defined in several places, the two most notable being a separate "universal constraints file" (.ucf) and special comments within the Verilog model.
A .ucf file is simply a list of constraints, such as
net "ram0_data<35>" loc="ab25"
which indicates that bit 35 of the signal ram0_data (which should be a port in the top-level Verilog module) should be assigned to pin AB25 on the FPGA. Sometimes it is useful to combine several related constraints on one line, using "|" characters to separate constraints.
net "ram0_data<35>" loc="ab25" | fast | iostandard=lvdci_33 | drive=12;
The above example again assigns bit 35 of the signal ram0_data to pin AB25, and also specifies that the i/o driver should be configured for fast slew rate, 3.3V LVTTL level signaling (with a built-in series termination resistor), and a drive strength of 12mA. See the Xilinx documentation for more details, but don't worry: all of the pin constraints have been written for you (more on this later). Constraints can also be specified within special comments in the Verilog code. For example,
reg [7:0] state;
// synthesis attribute init of state is "03";
The Xilinx synthesis engine will identify the phrase "synthesis attribute" within any comments, and will add the constraint following this phrase to the list of constraints loaded from a .ucf file. In the above example, the initial state of the state signal after the FPGA finishes configuring itself will be set to 8'h03. In general, it is bad form to specify the initial state of signals using constraints, rather than implementing a reset signal. In some advanced designs, however, such initializations are sometimes necessary. The tools recognize a huge variety of constraints, and an entire section of the online manual is dedicated to explaining them. Fortunately, understanding a few simple constraints is sufficient for most designs.

ISE and the 6.111 Labkit

The FPGA used in the labkit has 684 i/o pins, and most of them are actually being used. To simplify the process of adding pin constraints to a new design, two template files have been developed The file labkit.v is a template top-level Verilog module. This module defines names for all of the signals going in or out of the FPGA. Additionally, it provides default assignments for all FPGA outputs, so that unused circuits on the labkit are disabled. A template constraints file, labkit.ucf ties the signals in labkit.v to the appropriate physical FPGA pins.

A Tutorial

Download the following source files:

Create a New Project

  1. Select "New Project" from the "File" menu
  2. Enter a project name and choose a project location. Be sure the top-level module type is set to HDL (hardware design language).
  3. On the next form, enter the target device information. The FPGA on the labkits is a Virtex 2 family XC2V6000, speed grade 4, in a BF957 package.
  4. The next form allows you to create a new source file. If you have already written the Verilog code for your project, click "Next" to advance to the next form, which allows you to add existing source files. You can also add existing files to your project later, after the project has been created.
  5. After creating or adding any source files, click "Next" until you reach the project summary form, and then click "Finish" to create the project.

Add Sources and Constraints

  1. If you did not add all of your Verilog source files when creating the project, add them now by choosing "Add Source..." from the "Project" menu.

Modify the Top-Level Template

  1. To instantiate the counter module, add the following code to labkit.v, just before the endmodule statement.
    counter counter1 (.clock(clock_27mhz), .led(led));
  2. The stock labkit.v file assigns a default value to the LED outputs. In order to drive the LEDs with the counter module, we need to delete the following line, which can be found near the end of labkit.v.
    assign led = 8'hFF; // Turn off all LEDs

Compile the Design

  1. The labkit.ucf file includes constraints for all of the FPGA outputs defined in labkit.v. Most designs, however, will not use all of these signals, in which case the synthesis engine will optimize the unused signals out of the design. It is therefore necessary to tell the place-and-route tool not to generate an error if it encounters a pin constraint for a (now) nonexistent signal. To do this, right-click on the "Implement Design" item in the process pane, and select "Properties...". Check the box to "allow unmatched LOC constraints".
  2. Make sure the top-level labkit module is selected in the sources window.
  3. Double click "Generate Programming File" in the tasks window. This will cause the "Synthesize", "Implement Design" and "Generate Programming File" tasks to be run in order. A green check mark will appear beside each task when it is successfully completed. It will take approximately 5 minutes for everything to complete. (The tools seem to hang for a few minutes while generating the pad report, but this normal.)

Move the Design to a Compact Flash Card

  1. In the process pane, double-click on "Generate PROM, ACE or JTAG file". (You may have to expand the "Generate Programming File" line to see this item.)
  2. The iMPACT programming tool will launch, and a wizard will ask you several questions about what you want to do. You want to generate a SystemACE file.
  3. When asked to choose between the CF and MPM versions of SystemACE, choose CF (Compact Flash). The operating mode is unimportant.
  4. It is not necessary to specify the size of the CF card.
  5. Choose a name and a location for the SystemACE file collection you are about to generate.
  6. SystemACE allows up to eight designs to be stored on a single CF card. A switch on the labkit selects which design will be loaded into the FPGA. Check the boxes for as many designs as you want to include on the CF card.
  7. iMPACT will ask you to add design files for each configuration. You can potentially add more than one design file for each of the eight configurations (if there were more than one FPGA on the board, for example), but for the labkits, only add one design per configuration.
  8. Ignore the warnings about changing the configuration clock.
  9. Once all of the configurations have been specified, choose "Finish" and generate the .ace files. This takes a minute or two.
  10. Right click anywhere in the upper pane of the iMPACT window (with the system diagram) and select "Copy to CompactFlash...". Select the collection you just generated and copy it to the CF card.
  11. Insert the CF card in the labkit. Make sure the configuration source is set to "JTAG/CF", and select the configuration selector switch to the appropriate number. Power on the labkit, and your configuration should load.

10-FPGA Programming Methods

10 Ways To Program Your FPGA


6/10/2016 09:47 AM EDT
6 saves
Despite the recent push toward high level synthesis (HLS), hardware description languages (HDLs) remain king in field programmable gate array (FPGA) development. Specifically, two FPGA design languages have been used by most developers: VHDL and Verilog. Both of these “standard” HDLs emerged in the 1980s, initially intended only to describe and simulate the behavior of the circuit, not implement it.

However, if you can describe and simulate, it’s not long before you want to turn those descriptions into physical gates.
For the last 20 plus years most designs have been developed using one or the other of these languages, with some quite nasty and costly language wars fought. Other options rather than these two languages exist for programming your FPGA. Let’s take a look at what other tools we can use.
C / C++ / System C
The C, C++ or System C option allows us to leverage the capabilities of the largest devices while still achieving a semblance of a realistic development schedule... although that may just be my engineering management side coming out.

The ability to use C-based languages for FPGA design is brought about by HLS (high level synthesis), which has been on the verge of a breakthrough now for many years with tools like Handle-C and so on. Recently it has become a reality with both major vendors, Altera and Xilinx offering HLS within their toolsets Spectra-Q and Vivado HLx respectively.
A number of other C-based implementations are available, such as OpenCL which is designed for software engineers who want to achieve performance boosts by using a FPGA without a deep understanding of FPGA design. Whereas HLS is still very much in the area of FPGA engineers who want to increase productivity.
As with HDL, HLS has limitations when using C-based approaches, just like with traditional HDL you have to work with a subset of the language. For instance, it is difficult to synthesize and implement system calls, and we have to make sure everything is bounded and of a fixed size.
What is nice about HLS, however, is the ability to develop your algorithms in floating point and let the HLS tool address the floating- to fixed-point conversion.
As with many things, we are still at the start of the journey: I am sure over the coming years, we will see HLS increasingly used in different languages, making HLS similar to very low level of a software engineer’s C.

More info:

10 FPGA dev tools:

  • Page 1: C / C++ / System C
  • Page 2: MyHDL
  • Page 3: CHISEL
  • Page 4: JHDL
  • Page 5: BSV
  • Page 6: MATLAB
  • Page 7: LabVIEW FPGA
  • Page 8: SystemVerilog
  • Page 9: VHDL / VERILOG
  • Page 10: SPINAL HDL 

  • 123

    fpga4fun.comwhere FPGAs are fun

    FPGA software 1 - FPGA design software

    FPGA vendors provide design software that support their devices. It does four main things:
    • Design-entry.
    • Simulation.
    • Synthesis / place-and-route.
    • Programming through special cables (JTAG).
    There are usually two versions: one free that supports low to medium density FPGA devices, and a full (non-free) version of the same software for big devices.
    The free software is usually fine to start with because it is similar in functionality to the full version, and today's low to medium density devices are very capable.
    Here's a summary of the features/limitations of the software:

    Xilinx's ISE or the free ISE WebPACK
    Altera's Pro, Standard or (free) Web/Lite Quartus software
    Design-entryVHDL, Verilog, ABEL, Schematic, EDIFVHDL, Verilog, SystemVerilog, AHDL, Schematic, EDIF
    Core generatorYes (CORE Generator)Yes (MegaWizard Plug-Ins)
    Functional simulationNoNo (last version with simulation was 9.1SP2)
    Testbench simulationUse ISimUse ModelSim-Altera Starter Edition
    Synthesis/P&RFree version limited to small & medium devicesFree version limited to small & medium devices
    FPGA editorYes (FPGA editor)Yes (Chip Editor)
    Embedded logic analyzerChipScope PRO (a separate product - not free)SignalTap II (included in Quartus II Web/Lite edition)
    Older versionsAvailable from ISE ClassicsAvailable from the Quartus II Software Archive
    OS supportWindows + LinuxWindows + Linux
    PriceFree version: $0
    Full version: starting at $2995 for a 12 month license
    Free version: $0
    Full version: $2995 for a 12 month license
    Software matrixCheck hereCheck here
    Which is better?
    As of this writing (May 2013), Quartus-II is better overall - it runs faster, has a better GUI, better HDL support and includes one killer feature: SignalTap II embedded logic analyzer, which is easy to use and available in the free edition. Altera's low point is their simulator - they dropped their own integrated simulator but didn't have anything to replace it so rely on ModelSim for now.
    ISE is pretty good overall. Its low points are basic HDL support and ChipScope PRO (not part of the free suite).
    Xilinx has a new software suite called Vivado but limited to high-end devices.
    Xilinx traditionally had better silicon, and Altera better software... this seems to still hold true.

    OpenCores: EDA Tools


    OpenCores is the world largest community focusing on open source development targeted for hardware. Designing IP cores, is unfortunately not as simple as writing a C program. A lot more steps are needed to verify the cores and to ensure they can be synthesized to different FPGA architectures and various standard cell libraries.

    Open Source EDA tools

    There are plenty of good EDA tools that are open source available. The use of such tools makes it easier to collaborate at the opencores site. An IP that has readily available scripts for an open source HDL simulator makes it easier for an other person to verify and possibly update that particular core. A test environment that is built for a commercial simulator that only a limited number of people have access to makes verification more complicated.

    Icarus Verilog Simulator

    Icarus Verilog is a Verilog simulation and synthesis tool. It operates as a compiler, compiling source code writen in Verilog (IEEE-1364) into some target format. For batch simulation, the compiler can generate an intermediate form called vvp assembly. This intermediate form is executed by the &qout;vvp&qout; command. For synthesis, the compiler generates netlists in the desired format.
    The compiler proper is intended to parse and elaborate design descriptions written to the IEEE standard IEEE Std 1364-2005.
    Icarus web site


    Verilator is a free Verilog HDL simulator. It compiles synthesizable Verilog into an executable format and wraps it into a SystemC model. Internally a two-stage model is used. The resulting model executes about 10 times faster than standalone SystemC.
    Verilator has been used to simulate many very large multi-million gate designs with thousands of modules. Therefor we have chosen this tool to be used in the verification environment for the OpenRISC processor.
    Verilator web site

    GHDL VHDL simulator

    GHDL implements the VHDL87 (common name for IEEE 1076-1987) standard, the VHDL93 standard (aka IEEE 1076-1993) and the protected types of VHDL00 (aka IEEE 1076a or IEEE 1076-2000). The VHDL version can be selected with a command line option.
    GHDL web site

    EMACS - text editor

    GNU Emacs is an extensible, customizable text editor—and more.
    Very good support for both Verilog HDL and VHDL editing.
    Emacs web site

    Fizzim is a FREE, open-source GUI-based FSM design tool

    The GUI is written in java for portability. The backend code generation is written in perl for portability and ease of modification.



    • Runs on Windows, Linux, Apple, anything with java.
    • Familiar Windows look-and-feel.
    • Visibility (on/off/only-non-default) and color control on data and comment fields.
    • Multiple pages for complex state machines.
    • "Output to clipboard" makes it easy to pull the state diagram into your documentation.


    • Verilog code generation based on recommendations from experts in the field.
    • Output code has "hand-coded" look-and-feel (no tasks, functions, etc).
    • Switch between highly encoded or onehot output without changing the source.
    • Registered outputs can be specified to be included as state bits, or pulled out as independent flops.
    • Mealy and Moore outputs available.
    • Transition priority available.
    • Automatic grey coding available.
    • Code and/or comments can be inserted at strategic places in the output - no need to "perl" the output to add your copyright or `include
    Fizzim web site


    TCE is a toolset for designing application-specific processors (ASP) based on the Transport triggered architecture (TTA). The toolset provides a complete co-design flow from C programs down to synthesizable VHDL and parallel program binaries. Processor customization points include the register files, function units, supported operations, and the interconnection network.
    TCE has been developed internally in the Tampere University of Technology since the early 2003. The current source code base consists of roughly 400 000 lines of C++ code.
    TCE web site

    C to Verilog translation

    Available is an online C to Verilog compiler. The code generated by the site is licensed under BSD (use it "as is").
    C-to-Verilog web site

    Fedora Electronic Lab

    Fedora Electronic Lab tries to provide a complete hardware design flow with the best opensource tools. We try to ensure interoperability as far as we can and we work with other opensource developers to improve existing EDA tools.
    Fedora Electronic Lab web site

    The FreeHDL Project

    Linux - The logical choice for EDA

    Subproject Teams:

    AIRE Implementation
    Frontend -parser/analyzer/codegen
    Waveform Viewer


    Related Projects...

    Commercial EDA software for Linux

    Subscribe to mailing list

    Mailing list archives


    A project to develop a free, open source, GPL'ed VHDL simulator for Linux!

    Project goals:
    To develop a VHDL simulator that:
    • Has a graphical waveform viewer.
    • Has a source level debugger.
    • Is VHDL-93 compliant.
    • Is of commercial quality. (on par with, say, V-System - it'll take us a while to get there, but that should be our aim)
    • Is freely distributable - both source and binaries - like Linux itself. (Under the Gnu General Public License (GPL)).
    • Works with Linux. If others want to port it to other platforms they may, but it is not the goal of this project.
    FreeHDL is used by Qucs for digital simulation. Qucs is a circuit simulator with graphical user interface. Qucs aims to support all kinds of circuit simulation types, e.g. DC, AC, S-parameter, Transient, Noise and Harmonic Balance analysis. It is available from
    Release 0.0.7 of the FreeHDL compiler/simulater system can be downloaded from here.
    Release 0.0.6 of the FreeHDL compiler/simulater system can be downloaded from here.
    Release 0.0.5 of the FreeHDL compiler/simulater system can be downloaded from here.

    XCV7T2000T/X690T/HTG700 Monero Bitstreams

    Xilinx Virtex-7 V2000T-X690T-HTG700


    Product Description:

    Powered by Xilinx Virtex-7 V2000T, V585, or X690T the HTG700 is ideal for ASIC/SOC prototyping, high-performance computing, high-end image processing, PCI Express Gen 2 & 3 development, general purpose FPGA development, and/or applications requiring high speed serial transceivers (up to 12.5Gbps).

     Key Features and Benefits:

    • Scalable via HTG-FMC-FPGA module (with one X980T FPGA) for
    • x8 PCI Express Gen2 /Gen 3 edge connectors
    • x3 FPGA Mezzanine Connectors (FMC)
    • x4 SMA ports (16 SMAs providing 4 Txn/Txp/Rxn/Rxp) clocked by external
    • DDR3 SODIMM with support for up to 8GB (shipped with a 1GB module)
    • USB to UART bridge
    • Configuration through JTAG or Micron G18 Flash


     What's Included:

    • Reference Designs
    • Schematic, User Manual, UCF
    • The HTG700 Board

    HTG-700: Xilinx Virtex™ -7 PCI Express  Development Platform

    Powered by Xilinx Virtex-7 V2000T, V585, or X690T  the HTG-700 is ideal for ASIC/SOC prototyping, high-performance computing, high-end image processing, PCI Express Gen 2 & 3 development, general purpose FPGA development, and/or applications requiring high speed serial transceivers (up to 12.5Gbps).  

    Three High Pin Count (HPC) FMC connectors provide access to 480 single-ended I/Os and 24 high-speed Serial Transceivers of the on board Virtex 7 FPGA. Availability of over 100 different off-the-shelf FMC modules extend functionality of the board for variety of different applications. 

    Eight lane of PCI Express Gen 2 is supported by hard coded controllers inside the Virtex 7 FPGA. The board's layout, performance of the Virtex 7 FPGA fabric, high speed serial transceivers (used for PHY interface), flexible on-board clock/jitter attenuator, along with soft PCI Express Gen 3 IP core allow usage of the board for PCI Express Gen3 applications. 

    The HTG-700 Virtex 7 FPGA board can be used either in PCI Express mode (plugged into host PC/Server) or stand alone mode (powered by external ATX or wall power supply).

    Xilinx Virtex-7 V2000T, 585T, or X690T FPGA
    Scalable via HTG-777 FPGA module  for providing higher FPGA gate density
    x8 PCI Express Gen2 /Gen 3  edge connectors with jitter cleaner chip
      - Gen 3: with the -690 option
      - Gen 2: with the -585 or -2000 option (Gen3 requires soft IP core)
    x3 FPGA Mezzanine Connectors (FMC)
      - FMC #1: 80 LVDS (160 single-ended) I/Os and 8 GTX (12.5 Gbps) Serial
      - FMC #2: 80 LVDS (160 single-ended) I/Os and 8 GTX (12.5 Gbps) Serial
      - FMC #3: 80 LVDS (160 single-ended) I/Os and 8 GTX (12.5 Gbps) Serial
        Transceivers. Physical location of this  connector allows  plug-in FMC
        daughter cards having easy access to the board through the front panel.
    x4 SMA ports (16 SMAs providing 4 Txn/Txp/Rxn/Rxp) clocked by external pulse generators
    DDR3 SODIMM with support for up to 8GB (shipped with a 2GB module)
    Programmable oscillators (Silicon Labs Si570) for different interfaces
    Configuration through JTAG or Micron G18 Embedded Flash
    USB to UART bridge
    ATX and DC power supplies for PCI Express and Stand Alone operations
    LEDs & Pushbuttons
    Size: 9.5" x 4.25"
    Kit Content:

    HTG-700 board


    - PCI Express Drivers (evaluation) for Windows & Linux

    Reference Designs/Demos:
    - PCI Express Gen3 PIO
    - 10G & 40G Ethernet (available only if interested in licensing the IP cores)
    - DDR3 Memory Controller


    - User Manual
    - Schematics (in searchable .pdf format)
    - User Constraint File (UCF)

    Ordering information Part Numbers:
    - HTG-V7-PCIE-2000-2 (populated with V2000T-2 FPGA)
    Contact Us
    HTG-V7-PCIE-690-2   (populated with X690T-2 FPGA)  
    Contact us
    - HTG-V7-PCIE-690-3   (populated with X690T-3 FPGA)  
    Contact us
    HTG-V7-PCIE-585-2   (populated with V585T-2 FPGA)
    Contact us

    FPGA-(CVP-13, XUPVV4): Hardware Modifications

    Currently, (08d-10m-2018y), the Bittware cards (CVP-13, XUPVV4) do not require any modifications and will run at full speed out-of-the-box.

    If you have a VCU1525 or BCU1525, you should acquire a DC1613A USB dongle to change the core voltage.

    This dongle requires modifications to ‘fit’ into the connector on the VCU1525 or BCU1525.

    You can make the modifications yourself as described here,

    You can purchase buy a fully modified DC1613A from

    If you have an Avnet AES-KU040 and you are brave enough to make the complex modifications to run at full hash rate, you can download the modification guide right here (it will be online in a few days).

    You can see a video of the modded card On YouTube: Here.

    If you have a VCU1525 or BCU1525, we recommend using the TUL Water Block (this water block was designed by TUL, the company that designed the VCU/BCU cards).

    The water block can be purchased from

    WARNING:  Installation of the water block requires a full disassembly of the FPGA card which may void your warranty.

    Maximum hash rate (even beyond water-cooling) is achieved by immersion cooling, immersing the card in a non-conductive fluid.

    Engineering Fluids makes BC-888 and EC-100 fluids which are non-boiling and easy to use at home. You can buy them here.

    If you have a stock VCU1525, there is a danger of the power regulators failing from overheating, even if the FPGA is very cool.

     We recommend a simple modification to cool the power regulators by more than 10C.

    The modification is very simple. You need:
    First, cut a piece of thermal tape and apply it to the back side of the Slim X3 CPU cooler, and plug the fan into the fan controller:

    Then, you are going to stick the CPU cooler on the back plate of the VCU1525 on this area:

    Once done it will look like this:

    Make sure to connect the fan controller to the power supply and run the fan on maximum speed.

    This modification will cool the regulators on the back side of the VCU1525, dropping their temperature by more than 10C and extending the life of your hardware.

    This modification is not needed on ‘newer’ versions of the hardware such as the XBB1525 or BCU1525.


    Grab Bag of FPGA and GPU Software Tools from Intel, Xilinx & NVIDIA

    FPGA's as Accelerators:
    • From the Intel® FPGA SDK for OpenCL™ Product Brief available at link.
    • "The FPGA is designed to create custom hardware with each instruction being accelerated providing more efficiency use of the hardware than that of the CPU or GPU architecture would allow." 

    • With Intel, developers can utilize an x86 with a built-in FPGA or connect a card with an Intel or Xilinx FPGA to an x86. This Host + FPGA Acceleration would typically be used in a "server."
    • With Intel and Xilinx, developers can also get a chip with an ARM core + FPGA. This FPGA + ARM SoC Acceleration is typically used in embedded systems.
    • Developers can also connect a GPU card from Nvidia to an x86 host. Developers can also get an integrated GPU from Intel. Nvidia also provides chips with ARM cores + GPUs.

    • Intel and Xilinx provide tools to help developers accelerate x86 code execution using an FPGA attached to the x86. They also provide tools to accelerate ARM code execution using an FPGA attached to the ARM.
    • Intel, Xilinx and Nvidia all provide OpenCL libraries to access their hardware. These libraries can not interoperate with one another. Intel also provides libraries to support OpenMP and Nvidia provides CUDA for programming their GPUS. Xilinx includes their OpenCL library in an SDK called SDAccel and an SDK called SDSoC. SDAccel is used for x86 + Xilinx FPGA systems, i.e. servers. SDSoC is used for Xilinx chips with ARM + FPGAs, i.e. embedded systems.

    • To help developers building computer vision applications, Xilinx provides OpenVX, Caffe, OpenCV and various DNN and CNN libraries in an SDK called reVISION for software running on chips with an ARM+FPGA.
    • All of these libraries and many more are available for x86 systems.
    • Xilinx also provides neural network inference, HEVC decoders and encoders and SQL data-mover, function accelerator libraries.

    Tools for FPGA + ARM SoC Acceleration Intel:
    • From link developers can work with ARM SoCs from Intel using:
    • ARM DS-5 for debug
    • SoC FPGA Embedded Development Suite for embedded software development tools
    • Intel® Quartus® Prime Software for working with the programmable logic
    • Virtual Platform for simulating the ARM
    • SoC Linux for running Linux on the FPGA + ARM SoC
    • Higher Level
    • Intel® FPGA SDK for OpenCL™ is available for programming the ARM + FPGA chips using OpenCL.

    • Developers can work with ARM SoCs from Xilinx using:
    • An SDK for application development and debug
    • PetaLinux Tools for Linux development and ARM simulation and
    • Vivado for using the PL for working with its FPGA + ARM SoC chips

    Higher Level:
    • Xilinx provides SDSoC for accelerating ARM applications on the built-in FPGA. Users can program in C and/or C++ and SDSoC will automatically partition the algorithm between the ARM core and the FPGA. Developers can also program using OpenCL and SDSoC will link in an embedded OpenCL library and build the resulting ARM+FPGA system. SDSoC also supports debugging and profiling.

    Domain Specific:
    Xilinx leverages SDSoC to create an embedded vision stack called reVISION.



    spin system logo

    Joerg Schulenburg, Uni-Magdeburg, 2008-2016

    What is SpinPack?

    SPINPACK is a big program package to compute lowest eigenvalues and eigenstates and various expectation values (spin correlations etc) for quantum spin systems.

    These model systems can for example describe magnetic properties of insulators at very low temperatures (T=0) where the magnetic moments of the particles form entangled quantum states.

    The package generates the symmetrized configuration vector, the sparse matrix representing the quantum interactions and computes its eigenvectors and finaly some expectation values for the system.

    The first SPINPACK version was based on Nishimori's TITPACK (Lanczos method, no symmetries), but it was early converted to C/C++ and completely rewritten (1994/1995).

    Other diagonalization algorithms are implemented too (Lanzcos, 2x2-diagonalization and LAPACK/BLAS for smaller systems). It is able to handle Heisenberg, t-J, and Hubbard-systems up to 64 sites or more using special compiler and CPU features (usually up to 128) or more sites in slower emulation mode (C++ required).

    For instance we got the lowest eigenstates for the Heisenberg Hamiltonian on a 40 site square lattice on our machines at 2002. Note that the resources needed for computation grow exponentially with the system size.

    The package is written mainly in C to get it running on all unix systems. C++ is only needed for complex eigenvectors and twisted boundary conditions when C has no complex extension. This way the package is very portable.

    Parallelization can be done using MPI- and PTHREAD-library. Mixed mode (hybrid mode) is possible, but not always faster than pure MPI (2015). v2.60 has slightly hybrid mode advantage on CPUs supporting hyper-threading.

    This will hopefully be improved further. MPI-scaling is tested to work up to 6000 cores, PTHREAD-scaling up to 510 cores but requires careful tuning (scaling 2008-1016).

    The program can use all topological symmetries, S(z) symmetry and spin inversion to reduce matrix size. This will reduce the needed computing recources by a linear factor.

    Since 2015/2016 CPU vector extensions (SIMD, SSE2, AVX2) are supported to get better performance doing symmetry operations on bit representations of the quantum spins.

    The results are very reliable because the package has been used since 1995 in scientific work. Low-latency High-bandwith network and low latency memory is needed to get best performance on large scale clusters.


    • Groundstate of the S=1/2 Heisenberg AFM on a N=42 kagome biggest sub-matrix computed (Sz=1 k=Pi/7 size=36.7e9, nnz=41.59, v2.56 cplx8, using partly non-blocking hybrid code on supermuc.phase1 10400cores(650 nodes, 2 tasks/node, 8cores/task, 2hyperthreads/core, 4h), matrix_storage=0.964e6nz/s/core SpMV=6.58e6nz/s/core Feb2017)
    • Groundstate of the S=1/2 Heisenberg AFM on a N=42 linear chain computed (E0/Nw=-0.22180752, Hsize = 3.2e9, v2.38, Jan2009) using 900 Nodes of a SiCortex SC5832 700MHz 4GB RAM/Node (320min).
      Update: N=41 Hsize = 6.6e9, E0/Nw=-0.22107343 16*(16cores+256GB+IB)*32h matrix stored, v2.41 Oct2011).
    • Groundstate of the S=1/2 Heisenberg AFM on a N=42 square lattice computed (E0 = -28.43433834, Hsize = 1602437797, ((7,3),(0,6)), v2.34, Apr2008) using 23 Nodes a 2*DualOpteron-2.2GHz 4GB RAM via 1Gb-eth (92Cores usage=80%, ca.60GB RAM, 80MB/s BW, 250h/100It).
    • Program is ready for cluster (MPI and Pthread can be used at the same time, see the performance graphic) and can again use memory as storage media for performance measurement (Dec07).
    • Groundstate of the S=1/2 Heisenberg AFM on a N=40 square lattice computed (E0 = -27.09485025, Hsize = 430909650, v1.9.3, Jan2002).
    • Groundstate of the S=1/2 J1-J2-Heisenberg AFM on a N=40 square lattice J2=0.5, zero-momentum space: E0= -19.96304839, Hsize = 430909650 (15GB memory, 185GB disk, v2.23, 60 iterations, 210h, Altix-330 IA64-1.5GHz, 2 CPUs, GCC-3.3, Jan06)
    • Groundstate of the S=1/2 Heisenberg AFM on a N=39 triangular lattice computed (E0 = -21.7060606, Hsize = 589088346, v2.19, Jan2004).
    • Largest complex Matrix: Hsize=1.2e9 (26GB memory, 288GB disk, v2.19 Jul2003), 90 iterations: 374h alpha-1GHz (with limited disk data rate, 4 CPUs, til4_36)
    • Largest real Matrix: Hsize=1.3e9 (18GB memory, 259GB disk, v2.21 Apr2004), 90 iterations: real=40h cpu=127h sys=9% alpha-1.15GHz (8 CPUs, til9_42z7) 


      Verify download using: gpg --verify spinpack-2.55.tgz.asc spinpack-2.54.tgz



      • gunzip -c spinpack-xxx.tgz | tar -xf - # xxx is the version number
      • cd spinpack; ./configure --mpt
      • make test # to test the package and create exe path
      • # edit src/config.h exe/daten.def for your needs (see models/*.c)
      • make
      • cd exe; ./spin



      The documentation is available in the doc-path. Most parts of the documentation are rewritten in english now.

      If you still find some parts written in german or out-of-date documentation send me an email with a short hint where I find this part and I want to rewrite this part as soon as I can.

      Please see doc/history.html for latest changes. You can find a documentation about speed in the package or an older version on this spinpack-speed-page.

      Most Important Function:

      The most time consuming important function is b_smallest in hilbert.c. This function computes the representator of a set of symmetric spin configurations (bit pattern) from a member of this set.

      It also returns a phase factor and the orbit length. It would be a great progress, if the performance of that function could be improved. Ideas are welcome.

      One of my motivations is to use FPGAs in 2009 was inspired by the FPGA/VHDL-Compiler.

      These are Xilings-tools and are so slow, badly scaling and buggy, that code generation and debugging is really no fun and a much better FPGA toolchain is needed for HPC, but all that is fixed now with updates.

      2015-05 I added software benes-network to get gain of AVX2, but it looks like that its still not the maximum available speed (HT shows near 2 factor, bitmask falls out of L1-cache?).

      Examples for open access

      Please use these data for your work or verify my data. Questions and corrections are welcome. If you miss data or explanations here, please send a note to me.


      Frequently asked questions (FAQ):

       Q: I try to diagonalize a 4-spin system, but I do not get the full spectrum. Why?
       A: Spinpack is designed to handle big systems. Therefore it uses as much
          symmetries as it can. The very small 4-spin system has a very special
          symmetry which makes it equivalent to a 2-spin system build by two s=1 spins.
          Spinpack uses this symmetry automatically to give you the possibility
          to emulate s=1 (or s=3/2,etc) spin systems by pairs of s=1/2 spins.
          If you want to switch this off, edit src/config.h and change

      This picture is showing a small sample of a possible Hilbert matrix. The non-zero elements are shown as black pixels (v2.33 Feb2008 kago36z14j2).

      Hilbert matrix N=36 s=1/2 kago lattice

      This picture is showing a small sample of a possible Hilbert matrix. The non-zero elements are shown as black (J1) and gray (J2) pixels (v2.42 Nov2011 j1j2-chain N=18 Sz=0 k=0). Config space is sorted by J1-Ising-model-Energy to show structures of the matrix. Ising energy ranges are shown as slightly grayed arrays.

      Hilbert matrix for N=18 s=1/2 quantum chain

      Ground state energy scaling for finite size spin=1/2-AFM-chains N=4..40 using up to 300GB memory to store the N=39 sparse matrix and 245 CPU-houres (2011, src=lc.gpl).
      ground state s=1/2-AFM-LC

      Author: Joerg Schulenburg, Uni-Magdeburg, 2008-2016

      XC7V2000T-x690T_HTG-700: Models

      FPGA Scematic Updates: Cryptocurrency

      Reference Materials

      FPGA Scematic Updates

      FPGA Device Driver Memo



      FPGA device driver (Memory Mapped Kernel)


      A simple linux device driver for FPGA access. This driver provides memory mapped support and can communicate with FPGA designs. The advantage that memory mapping provides is that the system call overhead is completely reduced. However network overhead and the EPB bus bandwidth limitation exists. PowerPC in ROACH is mainly intended for control and monitoring. For larger transactions and performance numbers, the recommendation is to read the data directly from the FPGA through the 10Ge interface.

      Need for an alternate device driver

      BORPH software approach incurs system call latencies, which can degrade performance in applications that make frequent short or random accesses to FPGA resources. System calls are function invocations made from user-space in order to request some service from the operating system. Instead of making a series of system calls that involve file I/O, we memory map the FPGA to the user process address space in the PowerPC. Memory mapping forms an association between the FPGA and the user process memory. In doing so, the abstraction is being moved from the kernel to the user application. The performance of a memory mapped FPGA device is measurably better than the current approach of BORPH which presents a file system of hardware mapped registers. The contribution of a memory-mapped approach is two-fold: Firstly, the overhead of a system call performing I/O operations is eliminated. Secondly, unnecessary memory copies are not kept in the kernel. While the approach gives a performance benefit, it comes with the limitation that user applications are required to track and provide FPGA symbolic register name to memory to offset mapping. This limitation can be overcome by automating the mapping at FPGA design compile time in the same way as is currently done for BORPH, thereby abstracting this limitation away from ordinary users


      • Latest kernel support(Linux 3.10)
      • mmap method support
      • Improved performance
      • support for both ROACH and ROACH2 platforms


      • Experimental
      (Let me know feedback at to iron out bugs)


      The linux kernel communicates with the FPGA through special files called "device nodes". There are two device nodes to be created.
      • /dev/roach/config (FPGA configuration)
      • /dev/roach/mem (FPGA read write)
      tcpborphserver3 communicates with FPGA through these device specific nodes.
      • telnet ip-address portno
      You will see the katcp commands by issuing ?help

      Kernel Source

      There is a working config file incase you struggle to build the kernel image from the source on your own. NOTE: Depending on platform use roach or roach2.
        • make 44x/roach_defconfig (for roach2, make 44x/roach2_defconfig)
        • make cuImage.roach (for roach2, make cuImage.roach2)
      (The kernel binary built can be located in arch/powerpc/boot/cuImage.roach) The driver can be found in drivers/char/roach directory.

      Steps to follow

      1. Build the kernel binary from source as indicated above OR use the provided kernel precompiled binary(uImage-roach-mmap) available after checking out the git repository below
        1. git clone
        2. Note the two files uImage-roach-mmap and test_mmap_RW after checking out.
      2. Run the below macro in uboot assuming you are nfs booting and you have placed the uImage-roach-mmap file in location from where tftp can fetch it.
        1. setenv roachboot "dhcp;tftpboot 0x2000000 uImage-roach-mmap; setenv bootargs console=ttyS0,115200 root=/dev/nfs ip=dhcp;bootm 0x2000000"
        2. saveenv to save the created macro to flash
        3. run roachboot
      3. Ignore the fatal module dep warning that you see after booting the kernel. After kernel boots in init prompt type the following. Once netbooted, mount the nfs filesystem rw. Create device files if not created, /dev/roach/config that is the programming bitstream interface and /dev/roach/mem that is the memory mapped read/write interface using mknod.
        1. cat /proc/devices (To see whether driver is loaded and check the major number associated with it. I would expect you to see 252 major number)
        2. mount -o rw,remount /
        3. mkdir /dev/roach
        4. mknod /dev/roach/config c 252 0
        5. mknod /dev/roach/mem c 252 1
        6. mount -o ro,remount /
      4. Use tcpborphserver3 available along with KATCP which has registername to offset support logic.
        1. Issue katcp commands like ?progdev x.bof, ?listdev, ?wordread and ?wordwrite for communicating with designs.

      Reference userspace code

      The test_mmap_RW.c available in the checked out source code, performs reading, writing and verifying the scratchpad register half a million times. The code can be used as a reference and adapted to read data out of BRAMs and send as UDP packets. Note: The c file has to be cross-compiled to run on powerpc platform and then the executable can run on the PowerPC itself. is the authoritative source for checking the register name and memory offset into FPGA.