Wednesday, October 14, 2020

10-FPGA Programming Methods

123

Adam Taylor

6/10/2016 09:47 AM EDT
21 comments

6 saves

Despite the recent push toward high level synthesis (HLS), hardware description languages (HDLs) remain king in field programmable gate array (FPGA) development. Specifically, two FPGA design languages have been used by most developers: VHDL and Verilog. Both of these “standard” HDLs emerged in the 1980s, initially intended only to describe and simulate the behavior of the circuit, not implement it.

However, if you can describe and simulate, it’s not long before you want to turn those descriptions into physical gates.
For the last 20 plus years most designs have been developed using one or the other of these languages, with some quite nasty and costly language wars fought. Other options rather than these two languages exist for programming your FPGA. Let’s take a look at what other tools we can use.
C / C++ / System C

Click here for larger image

The C, C++ or System C option allows us to leverage the capabilities of the largest devices while still achieving a semblance of a realistic development schedule... although that may just be my engineering management side coming out.

The ability to use C-based languages for FPGA design is brought about by HLS (high level synthesis), which has been on the verge of a breakthrough now for many years with tools like Handle-C and so on. Recently it has become a reality with both major vendors, Altera and Xilinx offering HLS within their toolsets Spectra-Q and Vivado HLx respectively.
A number of other C-based implementations are available, such as OpenCL which is designed for software engineers who want to achieve performance boosts by using a FPGA without a deep understanding of FPGA design. Whereas HLS is still very much in the area of FPGA engineers who want to increase productivity.
As with HDL, HLS has limitations when using C-based approaches, just like with traditional HDL you have to work with a subset of the language. For instance, it is difficult to synthesize and implement system calls, and we have to make sure everything is bounded and of a fixed size.
What is nice about HLS, however, is the ability to develop your algorithms in floating point and let the HLS tool address the floating- to fixed-point conversion.
As with many things, we are still at the start of the journey: I am sure over the coming years, we will see HLS increasingly used in different languages, making HLS similar to very low level of a software engineer’s C.

More info:

PDF: Spectra-Q

PDF: Vivado HLx

10 FPGA dev tools:

Page 1: C / C++ / System C

Page 2: MyHDL

Page 3: CHISEL

Page 4: JHDL

Page 5: BSV

Page 6: MATLAB

Page 7: LabVIEW FPGA

Page 8: SystemVerilog

Page 9: VHDL / VERILOG

Page 10: SPINAL HDL

123

FPGA software 1 - FPGA design software

FPGA vendors provide design software that support their devices. It does four main things:

Design-entry.
Simulation.
Synthesis / place-and-route.
Programming through special cables (JTAG).

There are usually two versions: one free that supports low to medium density FPGA devices, and a full (non-free) version of the same software for big devices.

Xilinx's free software is named ISE WebPACK, which is a scaled-down version of the full ISE software.
Altera's free software is named Quartus Web/Lite Edition, which is a scaled-down version of the full Quartus II software.

The free software is usually fine to start with because it is similar in functionality to the full version, and today's low to medium density devices are very capable.
Here's a summary of the features/limitations of the software:

	Xilinx's ISE or the free ISE WebPACK	Altera's Pro, Standard or (free) Web/Lite Quartus software
Design-entry	VHDL, Verilog, ABEL, Schematic, EDIF	VHDL, Verilog, SystemVerilog, AHDL, Schematic, EDIF
Core generator	Yes (CORE Generator)	Yes (MegaWizard Plug-Ins)
Functional simulation	No	No (last version with simulation was 9.1SP2)
Testbench simulation	Use ISim	Use ModelSim-Altera Starter Edition
Synthesis/P&R	Free version limited to small & medium devices	Free version limited to small & medium devices
Programming	Yes	Yes
FPGA editor	Yes (FPGA editor)	Yes (Chip Editor)
Embedded logic analyzer	ChipScope PRO (a separate product - not free)	SignalTap II (included in Quartus II Web/Lite edition)
Older versions	Available from ISE Classics	Available from the Quartus II Software Archive
OS support	Windows + Linux	Windows + Linux
Price	Free version: $0 Full version: starting at $2995 for a 12 month license	Free version: $0 Full version: $2995 for a 12 month license
Software matrix	Check here	Check here

Which is better?

As of this writing (May 2013), Quartus-II is better overall - it runs faster, has a better GUI, better HDL support and includes one killer feature: SignalTap II embedded logic analyzer, which is easy to use and available in the free edition. Altera's low point is their simulator - they dropped their own integrated simulator but didn't have anything to replace it so rely on ModelSim for now.
ISE is pretty good overall. Its low points are basic HDL support and ChipScope PRO (not part of the free suite).
Xilinx has a new software suite called Vivado but limited to high-end devices.
Xilinx traditionally had better silicon, and Altera better software... this seems to still hold true.

FPGA software 2 - FPGA design entry ❯

OpenCores: EDA Tools

Introduction

OpenCores is the world largest community focusing on open source development targeted for hardware. Designing IP cores, is unfortunately not as simple as writing a C program. A lot more steps are needed to verify the cores and to ensure they can be synthesized to different FPGA architectures and various standard cell libraries.

Open Source EDA tools

There are plenty of good EDA tools that are open source available. The use of such tools makes it easier to collaborate at the opencores site. An IP that has readily available scripts for an open source HDL simulator makes it easier for an other person to verify and possibly update that particular core. A test environment that is built for a commercial simulator that only a limited number of people have access to makes verification more complicated.

Icarus Verilog Simulator

Icarus Verilog is a Verilog simulation and synthesis tool. It operates as a compiler, compiling source code writen in Verilog (IEEE-1364) into some target format. For batch simulation, the compiler can generate an intermediate form called vvp assembly. This intermediate form is executed by the &qout;vvp&qout; command. For synthesis, the compiler generates netlists in the desired format.
The compiler proper is intended to parse and elaborate design descriptions written to the IEEE standard IEEE Std 1364-2005.
Icarus web site

Verilator

Verilator is a free Verilog HDL simulator. It compiles synthesizable Verilog into an executable format and wraps it into a SystemC model. Internally a two-stage model is used. The resulting model executes about 10 times faster than standalone SystemC.
Verilator has been used to simulate many very large multi-million gate designs with thousands of modules. Therefor we have chosen this tool to be used in the verification environment for the OpenRISC processor.
Verilator web site

GHDL VHDL simulator

GHDL implements the VHDL87 (common name for IEEE 1076-1987) standard, the VHDL93 standard (aka IEEE 1076-1993) and the protected types of VHDL00 (aka IEEE 1076a or IEEE 1076-2000). The VHDL version can be selected with a command line option.
GHDL web site

EMACS - text editor

GNU Emacs is an extensible, customizable text editor—and more.
Very good support for both Verilog HDL and VHDL editing.
Emacs web site

Fizzim is a FREE, open-source GUI-based FSM design tool

The GUI is written in java for portability. The backend code generation is written in perl for portability and ease of modification.

Features:

GUI:

Runs on Windows, Linux, Apple, anything with java.
Familiar Windows look-and-feel.
Visibility (on/off/only-non-default) and color control on data and comment fields.
Multiple pages for complex state machines.
"Output to clipboard" makes it easy to pull the state diagram into your documentation.

Backend:

Verilog code generation based on recommendations from experts in the field.
Output code has "hand-coded" look-and-feel (no tasks, functions, etc).
Switch between highly encoded or onehot output without changing the source.
Registered outputs can be specified to be included as state bits, or pulled out as independent flops.
Mealy and Moore outputs available.
Transition priority available.
Automatic grey coding available.
Code and/or comments can be inserted at strategic places in the output - no need to "perl" the output to add your copyright or `include

Fizzim web site

TCE

TCE is a toolset for designing application-specific processors (ASP) based on the Transport triggered architecture (TTA). The toolset provides a complete co-design flow from C programs down to synthesizable VHDL and parallel program binaries. Processor customization points include the register files, function units, supported operations, and the interconnection network.
TCE has been developed internally in the Tampere University of Technology since the early 2003. The current source code base consists of roughly 400 000 lines of C++ code.
TCE web site

C to Verilog translation

Available is an online C to Verilog compiler. The code generated by the site is licensed under BSD (use it "as is").
C-to-Verilog web site

Fedora Electronic Lab

Fedora Electronic Lab tries to provide a complete hardware design flow with the best opensource tools. We try to ensure interoperability as far as we can and we work with other opensource developers to improve existing EDA tools.
Fedora Electronic Lab web site

The FreeHDL Project

Linux - The logical choice for EDA

Subproject Teams:
AIRE Implementation
Frontend -parser/analyzer/codegen
Simulator
Debugger
Waveform Viewer
Testing/compliance

Links
Related Projects...

Commercial EDA software for Linux
Press

Subscribe to mailing list

Mailing list archives

Download

A project to develop a free, open source, GPL'ed VHDL simulator for Linux!

Project goals:
To develop a VHDL simulator that:

Has a graphical waveform viewer.

Has a source level debugger.

Is VHDL-93 compliant.

Is of commercial quality. (on par with, say, V-System - it'll take us a while to get there, but that should be our aim)

Is freely distributable - both source and binaries - like Linux itself. (Under the Gnu General Public License (GPL)).

Works with Linux. If others want to port it to other platforms they may, but it is not the goal of this project.

News:
FreeHDL is used by Qucs for digital simulation. Qucs is a circuit simulator with graphical user interface. Qucs aims to support all kinds of circuit simulation types, e.g. DC, AC, S-parameter, Transient, Noise and Harmonic Balance analysis. It is available from http://sourceforge.net/projects/qucs.
Download:
Release 0.0.7 of the FreeHDL compiler/simulater system can be downloaded from here.
Release 0.0.6 of the FreeHDL compiler/simulater system can be downloaded from here.
Release 0.0.5 of the FreeHDL compiler/simulater system can be downloaded from here.

XCV7T2000T/X690T/HTG700 Monero Bitstreams

Xilinx Virtex-7 V2000T-X690T-HTG700

Part Number: HTG-V7-PCIE-2000
Device Support:
- Virtex-7
Vendor: HiTech Global Distribution, LLC
Program Tier: Certified

View Partner Profile

Overview

BTC

Product Description:

Powered by Xilinx Virtex-7 V2000T, V585, or X690T the HTG700 is ideal for ASIC/SOC prototyping, high-performance computing, high-end image processing, PCI Express Gen 2 & 3 development, general purpose FPGA development, and/or applications requiring high speed serial transceivers (up to 12.5Gbps).

Key Features and Benefits:

Scalable via HTG-FMC-FPGA module (with one X980T FPGA) for
x8 PCI Express Gen2 /Gen 3 edge connectors
x3 FPGA Mezzanine Connectors (FMC)
x4 SMA ports (16 SMAs providing 4 Txn/Txp/Rxn/Rxp) clocked by external
DDR3 SODIMM with support for up to 8GB (shipped with a 1GB module)
USB to UART bridge
Configuration through JTAG or Micron G18 Flash

What's Included:

Reference Designs
Schematic, User Manual, UCF
The HTG700 Board

HTG-700: Xilinx Virtex™ -7 PCI Express Development Platform

Powered by Xilinx Virtex-7 V2000T, V585, or X690T the HTG-700 is ideal for ASIC/SOC prototyping, high-performance computing, high-end image processing, PCI Express Gen 2 & 3 development, general purpose FPGA development, and/or applications requiring high speed serial transceivers (up to 12.5Gbps).

Three High Pin Count (HPC) FMC connectors provide access to 480 single-ended I/Os and 24 high-speed Serial Transceivers of the on board Virtex 7 FPGA. Availability of over 100 different off-the-shelf FMC modules extend functionality of the board for variety of different applications.

Eight lane of PCI Express Gen 2 is supported by hard coded controllers inside the Virtex 7 FPGA. The board's layout, performance of the Virtex 7 FPGA fabric, high speed serial transceivers (used for PHY interface), flexible on-board clock/jitter attenuator, along with soft PCI Express Gen 3 IP core allow usage of the board for PCI Express Gen3 applications.

The HTG-700 Virtex 7 FPGA board can be used either in PCI Express mode (plugged into host PC/Server) or stand alone mode (powered by external ATX or wall power supply).

Features:

►Xilinx Virtex-7 V2000T, 585T, or X690T FPGA
►Scalable via HTG-777 FPGA module for providing higher FPGA gate density
► x8 PCI Express Gen2 /Gen 3 edge connectors with jitter cleaner chip
- Gen 3: with the -690 option
- Gen 2: with the -585 or -2000 option (Gen3 requires soft IP core)
►x3 FPGA Mezzanine Connectors (FMC)
- FMC #1: 80 LVDS (160 single-ended) I/Os and 8 GTX (12.5 Gbps) Serial
    Transceivers
- FMC #2: 80 LVDS (160 single-ended) I/Os and 8 GTX (12.5 Gbps) Serial
    Transceivers
- FMC #3: 80 LVDS (160 single-ended) I/Os and 8 GTX (12.5 Gbps) Serial
    Transceivers. Physical location of this connector allows plug-in FMC
    daughter cards having easy access to the board through the front panel.
►x4 SMA ports (16 SMAs providing 4 Txn/Txp/Rxn/Rxp) clocked by external pulse generators
►DDR3 SODIMM with support for up to 8GB (shipped with a 2GB module)
►Programmable oscillators (Silicon Labs Si570) for different interfaces
►Configuration through JTAG or Micron G18 Embedded Flash
►USB to UART bridge
►ATX and DC power supplies for PCI Express and Stand Alone operations
►LEDs & Pushbuttons
►Size: 9.5" x 4.25"

Kit Content:

Hardware:
- HTG-700 board

Software:
- PCI Express Drivers (evaluation) for Windows & Linux

Reference Designs/Demos:
- PCI Express Gen3 PIO
- 10G & 40G Ethernet (available only if interested in licensing the IP cores)
- DDR3 Memory Controller

Documents:
- User Manual
- Schematics (in searchable .pdf format)
- User Constraint File (UCF)

Ordering information Part Numbers:
- HTG-V7-PCIE-2000-2 (populated with V2000T-2 FPGA)
Price: Contact Us
- HTG-V7-PCIE-690-2   (populated with X690T-2 FPGA)
Price: Contact us
- HTG-V7-PCIE-690-3   (populated with X690T-3 FPGA)
Price: Contact us
- HTG-V7-PCIE-585-2   (populated with V585T-2 FPGA)
Price: Contact us

FPGA-(CVP-13, XUPVV4): Hardware Modifications

Currently, (08d-10m-2018y), the Bittware cards (CVP-13, XUPVV4) do not require any modifications and will run at full speed out-of-the-box.

If you have a VCU1525 or BCU1525, you should acquire a DC1613A USB dongle to change the core voltage.

This dongle requires modifications to ‘fit’ into the connector on the VCU1525 or BCU1525.

You can make the modifications yourself as described here,

You can purchase buy a fully modified DC1613A from https://shop.fpga.guide.

If you have an Avnet AES-KU040 and you are brave enough to make the complex modifications to run at full hash rate, you can download the modification guide right here (it will be online in a few days).

You can see a video of the modded card On YouTube: Here.

If you have a VCU1525 or BCU1525, we recommend using the TUL Water Block (this water block was designed by TUL, the company that designed the VCU/BCU cards).

The water block can be purchased from https://shop.fpga.guide.

WARNING: Installation of the water block requires a full disassembly of the FPGA card which may void your warranty.

Maximum hash rate (even beyond water-cooling) is achieved by immersion cooling, immersing the card in a non-conductive fluid.

Engineering Fluids makes BC-888 and EC-100 fluids which are non-boiling and easy to use at home. You can buy them here.

If you have a stock VCU1525, there is a danger of the power regulators failing from overheating, even if the FPGA is very cool.

We recommend a simple modification to cool the power regulators by more than 10C.

The modification is very simple. You need:

Thermaltake Slim X3 CPU cooler:
https://www.newegg.com/Product/Product.aspx?Item=N82E16835106152
Noctua NA-FC1 fan controller or any fan controller/driver:
https://www.newegg.com/Product/Product.aspx?Item=9SIAADY5SE8250
Thermal interface tape, any brand will do:
https://www.digikey.com/product-detail/en/BP100-0.011-00-1010/BER160-ND/307782

First, cut a piece of thermal tape and apply it to the back side of the Slim X3 CPU cooler, and plug the fan into the fan controller:

Then, you are going to stick the CPU cooler on the back plate of the VCU1525 on this area:

Once done it will look like this:

Make sure to connect the fan controller to the power supply and run the fan on maximum speed.

This modification will cool the regulators on the back side of the VCU1525, dropping their temperature by more than 10C and extending the life of your hardware.

This modification is not needed on ‘newer’ versions of the hardware such as the XBB1525 or BCU1525.

OptEdited:

Source-Page:
Grab Bag of FPGA and GPU Software Tools from Intel, Xilinx & NVIDIA

FPGA's as Accelerators:

From the Intel® FPGA SDK for OpenCL™ Product Brief available at link.
"The FPGA is designed to create custom hardware with each instruction being accelerated providing more efficiency use of the hardware than that of the CPU or GPU architecture would allow."

Hardware:

With Intel, developers can utilize an x86 with a built-in FPGA or connect a card with an Intel or Xilinx FPGA to an x86. This Host + FPGA Acceleration would typically be used in a "server."
With Intel and Xilinx, developers can also get a chip with an ARM core + FPGA. This FPGA + ARM SoC Acceleration is typically used in embedded systems.
Developers can also connect a GPU card from Nvidia to an x86 host. Developers can also get an integrated GPU from Intel. Nvidia also provides chips with ARM cores + GPUs.

Tools:

Intel and Xilinx provide tools to help developers accelerate x86 code execution using an FPGA attached to the x86. They also provide tools to accelerate ARM code execution using an FPGA attached to the ARM.
Intel, Xilinx and Nvidia all provide OpenCL libraries to access their hardware. These libraries can not interoperate with one another. Intel also provides libraries to support OpenMP and Nvidia provides CUDA for programming their GPUS. Xilinx includes their OpenCL library in an SDK called SDAccel and an SDK called SDSoC. SDAccel is used for x86 + Xilinx FPGA systems, i.e. servers. SDSoC is used for Xilinx chips with ARM + FPGAs, i.e. embedded systems.

Libraries:

To help developers building computer vision applications, Xilinx provides OpenVX, Caffe, OpenCV and various DNN and CNN libraries in an SDK called reVISION for software running on chips with an ARM+FPGA.
All of these libraries and many more are available for x86 systems.
Xilinx also provides neural network inference, HEVC decoders and encoders and SQL data-mover, function accelerator libraries.

Tools for FPGA + ARM SoC Acceleration Intel:

From link developers can work with ARM SoCs from Intel using:
ARM DS-5 for debug
SoC FPGA Embedded Development Suite for embedded software development tools
Intel® Quartus® Prime Software for working with the programmable logic
Virtual Platform for simulating the ARM
SoC Linux for running Linux on the FPGA + ARM SoC
Higher Level
Intel® FPGA SDK for OpenCL™ is available for programming the ARM + FPGA chips using OpenCL.

Xilinx:

Developers can work with ARM SoCs from Xilinx using:
An SDK for application development and debug
PetaLinux Tools for Linux development and ARM simulation and
Vivado for using the PL for working with its FPGA + ARM SoC chips

Higher Level:

Xilinx provides SDSoC for accelerating ARM applications on the built-in FPGA. Users can program in C and/or C++ and SDSoC will automatically partition the algorithm between the ARM core and the FPGA. Developers can also program using OpenCL and SDSoC will link in an embedded OpenCL library and build the resulting ARM+FPGA system. SDSoC also supports debugging and profiling.

Domain Specific:
Xilinx leverages SDSoC to create an embedded vision stack called reVISION.

Source-Website:

SPINPACK

Author:
Joerg Schulenburg, Uni-Magdeburg, 2008-2016

What is SpinPack?

SPINPACK is a big program package to compute lowest eigenvalues and eigenstates and various expectation values (spin correlations etc) for quantum spin systems.

These model systems can for example describe magnetic properties of insulators at very low temperatures (T=0) where the magnetic moments of the particles form entangled quantum states.

The package generates the symmetrized configuration vector, the sparse matrix representing the quantum interactions and computes its eigenvectors and finaly some expectation values for the system.

The first SPINPACK version was based on Nishimori's TITPACK (Lanczos method, no symmetries), but it was early converted to C/C++ and completely rewritten (1994/1995).

Other diagonalization algorithms are implemented too (Lanzcos, 2x2-diagonalization and LAPACK/BLAS for smaller systems). It is able to handle Heisenberg, t-J, and Hubbard-systems up to 64 sites or more using special compiler and CPU features (usually up to 128) or more sites in slower emulation mode (C++ required).

For instance we got the lowest eigenstates for the Heisenberg Hamiltonian on a 40 site square lattice on our machines at 2002. Note that the resources needed for computation grow exponentially with the system size.

The package is written mainly in C to get it running on all unix systems. C++ is only needed for complex eigenvectors and twisted boundary conditions when C has no complex extension. This way the package is very portable.

Parallelization can be done using MPI- and PTHREAD-library. Mixed mode (hybrid mode) is possible, but not always faster than pure MPI (2015). v2.60 has slightly hybrid mode advantage on CPUs supporting hyper-threading.

This will hopefully be improved further. MPI-scaling is tested to work up to 6000 cores, PTHREAD-scaling up to 510 cores but requires careful tuning (scaling 2008-1016).

The program can use all topological symmetries, S(z) symmetry and spin inversion to reduce matrix size. This will reduce the needed computing recources by a linear factor.

Since 2015/2016 CPU vector extensions (SIMD, SSE2, AVX2) are supported to get better performance doing symmetry operations on bit representations of the quantum spins.

The results are very reliable because the package has been used since 1995 in scientific work. Low-latency High-bandwith network and low latency memory is needed to get best performance on large scale clusters.

News:

Groundstate of the S=1/2 Heisenberg AFM on a N=42 kagome biggest sub-matrix computed (Sz=1 k=Pi/7 size=36.7e9, nnz=41.59, v2.56 cplx8, using partly non-blocking hybrid code on supermuc.phase1 10400cores(650 nodes, 2 tasks/node, 8cores/task, 2hyperthreads/core, 4h), matrix_storage=0.964e6nz/s/core SpMV=6.58e6nz/s/core Feb2017)
Groundstate of the S=1/2 Heisenberg AFM on a N=42 linear chain computed (E0/Nw=-0.22180752, Hsize = 3.2e9, v2.38, Jan2009) using 900 Nodes of a SiCortex SC5832 700MHz 4GB RAM/Node (320min).
Update: N=41 Hsize = 6.6e9, E0/Nw=-0.22107343 16*(16cores+256GB+IB)*32h matrix stored, v2.41 Oct2011).
Groundstate of the S=1/2 Heisenberg AFM on a N=42 square lattice computed (E0 = -28.43433834, Hsize = 1602437797, ((7,3),(0,6)), v2.34, Apr2008) using 23 Nodes a 2*DualOpteron-2.2GHz 4GB RAM via 1Gb-eth (92Cores usage=80%, ca.60GB RAM, 80MB/s BW, 250h/100It).
Program is ready for cluster (MPI and Pthread can be used at the same time, see the performance graphic) and can again use memory as storage media for performance measurement (Dec07).
Groundstate of the S=1/2 Heisenberg AFM on a N=40 square lattice computed (E0 = -27.09485025, Hsize = 430909650, v1.9.3, Jan2002).
Groundstate of the S=1/2 J1-J2-Heisenberg AFM on a N=40 square lattice J2=0.5, zero-momentum space: E0= -19.96304839, Hsize = 430909650 (15GB memory, 185GB disk, v2.23, 60 iterations, 210h, Altix-330 IA64-1.5GHz, 2 CPUs, GCC-3.3, Jan06)
Groundstate of the S=1/2 Heisenberg AFM on a N=39 triangular lattice computed (E0 = -21.7060606, Hsize = 589088346, v2.19, Jan2004).
Largest complex Matrix: Hsize=1.2e9 (26GB memory, 288GB disk, v2.19 Jul2003), 90 iterations: 374h alpha-1GHz (with limited disk data rate, 4 CPUs, til4_36)
Largest real Matrix: Hsize=1.3e9 (18GB memory, 259GB disk, v2.21 Apr2004), 90 iterations: real=40h cpu=127h sys=9% alpha-1.15GHz (8 CPUs, til9_42z7)

Download:

Verify download using: gpg --verify spinpack-2.55.tgz.asc spinpack-2.54.tgz

spinpack.tgz experimental developper version (may have bug fixes, new features or speed improvements, see doc/history.html)
spinpack-2.56c.tgz 2.57 backport fixes, above 2048*16threads, FTLM-random-fix, see doc/history
spinpack-2.56.tgz better hybrid MPI-scaling above 1000 tasks, tested on kagome42_sym14_sz13..6, pgp-sign, updated 2017-02-23, see doc/history, still blocking MPI only)
spinpack-2.55.tgz better MPI-scaling above 1000 tasks, tested on kagome42_sym14_sz13..8..1, pgp-sign, updated 2017-02-21, see doc/history)
spinpack-2.52.tgz OpenMP-support (implemented as pthread-emulation), but weak mixed code speed, pgp-sign, Dec16)
spinpack-2.51.tgz g++6-adaptions (gcc6.2 compile-errors/warnings fixed, pgp-sign, Sep16)
spinpack-2.50d.tgz SIMD-support (SSE2,AVX2), lot of bug-fixes (Jan16+fixFeb16+fixMar16b+c+fixApr16d))
spinpack-2.49.tgz mostly bug-fixes (Mar15) (updated Mar15,12, buggy bfly-bench, NN>32 32bit-compile-error.patch, see experimental version above)
spinpack-2.48.tgz test-version (v2.48pre Feb14 new features, +tUfixMay14 +chkptFixDez14 +2ndrunFixJan15)
spinpack-2.47.tgz bug fixes (see doc/history.html, bug fixes of 2.45-2.46) (version 2014/02/14, 1MB, gpg-signatur)
spinpack-2.44.tgz (see doc/history.html, known bugs) (version 2013/01/23 + fix May13,May14 2.44c, 1MB, gpg-signatur)
spinpack-2.43.tgz +checkpointing (see doc/history.html) (version 2012/05/23, 1MB, gpg-signatur)
spinpack-2.42.tgz ns.mpi-speed++ (see doc/history.html) (version 2012/05/07, 1MB, gpg-signatur)
spinpack-2.41.tgz mpi-speed++,doc++ (see doc/history.html) (version 2011/10/24 + backport-fix 2015-09-23, 1MB, gpg-signatur)
spinpack-2.40.tgz bug fixes (see doc/history.html) (version 2009/11/26, 890kB, gpg-signatur)
spinpack-2.39.tgz new option -m, new lattice (doc/history.html) (version 2009/04/20, 849kB, gpg-signatur)
spinpack-2.38.tgz MPI-fixes (doc/history.html) (version 2009/02/11, 849kB, gpg-signatur)
spinpack-2.36.tgz MPI-tuned (doc/history.html) (version 2008/08/04, 802kB, gpg-signatur)
spinpack-2.35.tgz IA64-tuned (doc/history.html) (version 2008/07/21, 796kB, gpg-signatur)
spinpack-2.34.tgz bugs fixed for MPI (doc/history.html) (version 2008/04/23, 770kB, gpg-signatur)
spinpack-2.33.tgz bugs fixed for MPI (doc/history.html) (version 2008/03/16, 620kB, gpg-signatur)
spinpack-2.32.tgz bug fixed (doc/history.html) (version 2008/02/19, 544kB, gpg-signatur)
spinpack-2.31.tgz MPI works and scales (version 2007/12/14, 544kB, gpg-signatur)
spinpack-2.26.tgz code simplified and partly speedup, prepare for FPGA and MPI (version 07/02/27, gpg-signatur)
spinpack-2.15.tgz see doc/history.tex (updated 2003/01/20)

Installation:

gunzip -c spinpack-xxx.tgz | tar -xf - # xxx is the version number
cd spinpack; ./configure --mpt
make test # to test the package and create exe path
# edit src/config.h exe/daten.def for your needs (see models/*.c)
make
cd exe; ./spin

Documentation:

The documentation is available in the doc-path. Most parts of the documentation are rewritten in english now.

If you still find some parts written in german or out-of-date documentation send me an email with a short hint where I find this part and I want to rewrite this part as soon as I can.

Please see doc/history.html for latest changes. You can find a documentation about speed in the package or an older version on this spinpack-speed-page.

Most Important Function:

The most time consuming important function is b_smallest in hilbert.c. This function computes the representator of a set of symmetric spin configurations (bit pattern) from a member of this set.

It also returns a phase factor and the orbit length. It would be a great progress, if the performance of that function could be improved. Ideas are welcome.

One of my motivations is to use FPGAs in 2009 was inspired by the FPGA/VHDL-Compiler.

These are Xilings-tools and are so slow, badly scaling and buggy, that code generation and debugging is really no fun and a much better FPGA toolchain is needed for HPC, but all that is fixed now with updates.

2015-05 I added software benes-network to get gain of AVX2, but it looks like that its still not the maximum available speed (HT shows near 2 factor, bitmask falls out of L1-cache?).

Examples for open access

Please use these data for your work or verify my data. Questions and corrections are welcome. If you miss data or explanations here, please send a note to me.

s=1/2 Heisenberg model square lattice (finite size extrapolation: gnuplot data, gnuplot script)
s=1/2 Heisenberg model triangular lattice (finite size extrapolation: gnuplot data, gnuplot script)
s=1/2 Heisenberg model kagome lattice (finite size extrapolation: gnuplot script, data included)

Frequently asked questions (FAQ):

 Q: I try to diagonalize a 4-spin system, but I do not get the full spectrum. Why?
 A: Spinpack is designed to handle big systems. Therefore it uses as much
    symmetries as it can. The very small 4-spin system has a very special
    symmetry which makes it equivalent to a 2-spin system build by two s=1 spins.
    Spinpack uses this symmetry automatically to give you the possibility
    to emulate s=1 (or s=3/2,etc) spin systems by pairs of s=1/2 spins.
    If you want to switch this off, edit src/config.h and change
    CONFIG_S1SYM to CONFIG_NOS1SYM.

This picture is showing a small sample of a possible Hilbert matrix. The non-zero elements are shown as black pixels (v2.33 Feb2008 kago36z14j2).

This picture is showing a small sample of a possible Hilbert matrix. The non-zero elements are shown as black (J1) and gray (J2) pixels (v2.42 Nov2011 j1j2-chain N=18 Sz=0 k=0). Config space is sorted by J1-Ising-model-Energy to show structures of the matrix. Ising energy ranges are shown as slightly grayed arrays.

Hilbert matrix for N=18 s=1/2 quantum chain

Ground state energy scaling for finite size spin=1/2-AFM-chains N=4..40 using up to 300GB memory to store the N=39 sparse matrix and 245 CPU-houres (2011, src=lc.gpl).

Author: Joerg Schulenburg, Uni-Magdeburg, 2008-2016

XC7V2000T-x690T_HTG-700: Models

FPGA Scematic Updates: Cryptocurrency

Reference Materials

FPGA Scematic Updates

FPGA Device Driver Memo

[hide]

1 FPGA device driver (Memory Mapped Kernel)

FPGA device driver (Memory Mapped Kernel)

Description

A simple linux device driver for FPGA access. This driver provides memory mapped support and can communicate with FPGA designs. The advantage that memory mapping provides is that the system call overhead is completely reduced. However network overhead and the EPB bus bandwidth limitation exists. PowerPC in ROACH is mainly intended for control and monitoring. For larger transactions and performance numbers, the recommendation is to read the data directly from the FPGA through the 10Ge interface.

Need for an alternate device driver

BORPH software approach incurs system call latencies, which can degrade performance in applications that make frequent short or random accesses to FPGA resources. System calls are function invocations made from user-space in order to request some service from the operating system. Instead of making a series of system calls that involve file I/O, we memory map the FPGA to the user process address space in the PowerPC. Memory mapping forms an association between the FPGA and the user process memory. In doing so, the abstraction is being moved from the kernel to the user application. The performance of a memory mapped FPGA device is measurably better than the current approach of BORPH which presents a file system of hardware mapped registers. The contribution of a memory-mapped approach is two-fold: Firstly, the overhead of a system call performing I/O operations is eliminated. Secondly, unnecessary memory copies are not kept in the kernel. While the approach gives a performance benefit, it comes with the limitation that user applications are required to track and provide FPGA symbolic register name to memory to offset mapping. This limitation can be overcome by automating the mapping at FPGA design compile time in the same way as is currently done for BORPH, thereby abstracting this limitation away from ordinary users

Advantages

Latest kernel support(Linux 3.10)
mmap method support
Improved performance
support for both ROACH and ROACH2 platforms

Implications

Experimental

(Let me know feedback at shanly@ska.ac.za to iron out bugs)

Usage

The linux kernel communicates with the FPGA through special files called "device nodes". There are two device nodes to be created.

/dev/roach/config (FPGA configuration)
/dev/roach/mem (FPGA read write)

tcpborphserver3 communicates with FPGA through these device specific nodes.

telnet ip-address portno

You will see the katcp commands by issuing ?help

Kernel Source

https://github.com/ska-sa/roach2_linux

There is a working config file incase you struggle to build the kernel image from the source on your own. NOTE: Depending on platform use roach or roach2.

- make 44x/roach_defconfig (for roach2, make 44x/roach2_defconfig)
- make cuImage.roach (for roach2, make cuImage.roach2)

(The kernel binary built can be located in arch/powerpc/boot/cuImage.roach) The driver can be found in drivers/char/roach directory.

Steps to follow

Build the kernel binary from source as indicated above OR use the provided kernel precompiled binary(uImage-roach-mmap) available after checking out the git repository below
1. git clone https://github.com/shanlyrajan/roach2_linux
2. Note the two files uImage-roach-mmap and test_mmap_RW after checking out.
Run the below macro in uboot assuming you are nfs booting and you have placed the uImage-roach-mmap file in location from where tftp can fetch it.
1. setenv roachboot "dhcp;tftpboot 0x2000000 uImage-roach-mmap; setenv bootargs console=ttyS0,115200 root=/dev/nfs ip=dhcp;bootm 0x2000000"
2. saveenv to save the created macro to flash
3. run roachboot
Ignore the fatal module dep warning that you see after booting the kernel. After kernel boots in init prompt type the following. Once netbooted, mount the nfs filesystem rw. Create device files if not created, /dev/roach/config that is the programming bitstream interface and /dev/roach/mem that is the memory mapped read/write interface using mknod.
1. cat /proc/devices (To see whether driver is loaded and check the major number associated with it. I would expect you to see 252 major number)
2. mount -o rw,remount /
3. mkdir /dev/roach
4. mknod /dev/roach/config c 252 0
5. mknod /dev/roach/mem c 252 1
6. mount -o ro,remount /
Use tcpborphserver3 available along with KATCP which has registername to offset support logic.
1. Issue katcp commands like ?progdev x.bof, ?listdev, ?wordread and ?wordwrite for communicating with designs.

Reference userspace code

The test_mmap_RW.c available in the checked out source code, performs reading, writing and verifying the scratchpad register half a million times. The code can be used as a reference and adapted to read data out of BRAMs and send as UDP packets. Note: The c file has to be cross-compiled to run on powerpc platform and then the executable can run on the PowerPC itself. Core_info.tab is the authoritative source for checking the register name and memory offset into FPGA.

Search

Toolbox

FPGA Heterogeneous Self-Healing

FPGA Autonomous Acceleration Self-Healing

This example uses FPGA-in-the-Loop (FIL) simulation to accelerate a video processing simulation with Simulink® by adding an FPGA. The process shown analyzes a simple system that sharpens an RGB video input at 24 frames per second.

This example uses the Computer Vision System Toolbox™ in conjunction with Simulink® HDL Coder™ and HDL Verifier™ to show a design workflow for implementing FIL simulation.

Products required to run this example:

MATLAB
Simulink
Fixed-Point Designer
DSP System Toolbox
Computer Vision System Toolbox
HDL Verifier
HDL Coder
FPGA design software (Xilinx® ISE® or Vivado® design suite or Intel® Quartus® Prime design software)
One of the supported FPGA development boards and accessories (the ML403, SP601, BeMicro SDK, and Cyclone III Starter Kit boards are not supported for this example)
For connection using Ethernet: Gigabit Ethernet Adapter installed on host computer, Gigabit Ethernet crossover cable
For connection using JTAG: USB Blaster I or II cable and driver for Altera FPGA boards. Digilent® JTAG cable and driver for Xilinx FPGA boards.
For connection using PCI Express®: FPGA board installed into PCI Express slot of host computer.

MATLAB® and FPGA design software can either be locally installed on your computer or on a network accessible device. If you use software from the network you will need a second network adapter installed in your computer to provide a private network to the FPGA development board. Consult the hardware and networking guides for your computer to learn how to install the network adapter.
Note: The demonstration includes code generation. Simulink does not permit you to modify the MATLAB installation area. If necessary, change to a working directory that is not in the MATLAB installation area prior to starting this example.

1. Open and Execute the Simulink Model

Open the fil_videosharp_sim.mdl and run the simulation for 0.21s.

Due to the large quantity of data to process , the simulation is not fluent. We will improve the simulation speed in the following steps by using a FPGA-in-the-Loop.

2. Generate HDL Code

Generate HDL code for the Streaming Video Sharpening subsystem by performing these steps:
a. Right-click on the block labeled Streaming 2-D FIR Filter.
b. Select HDL Code Generation > Generate HDL for Subsystem in the context menu.
Alternatively, you can generate HDL code by entering the following command at the MATLAB prompt:

>> makehdl('fil_videosharp_sim/Streaming 2-D FIR Filter')

If you do not want to generate HDL code, you can copy pre-generated HDL files to the current directory using this command:

>> copyFILDemoFiles('videosharp');

3. Set Up FPGA Design Software

Before using FPGA-in-the-Loop, make sure your system environment is set up properly for accessing FPGA design software. You can use the function hdlsetuptoolpath to add ISE or Quartus II to the system path for the current MATLAB session.
For Xilinx FPGA boards, run

hdlsetuptoolpath('ToolName', 'Xilinx ISE', 'ToolPath', 'C:\Xilinx\13.1\ISE_DS\ISE\bin\nt64\ise.exe');

This example assumes that the Xilinx ISE executable is C:\Xilinx\13.1\ISE_DS\ISE\bin\nt64\ise.exe. Substitute with your actual executable if it is different.
For Altera boards, run

hdlsetuptoolpath('ToolName','Altera Quartus II','ToolPath','C:\altera\11.0\quartus\bin\quartus.exe');

This example assumes that the Altera Quartus II executable is C:\altera\11.0\quartus\bin\quartus.exe. Substitute with your actual executable if it is different.

4. Run FPGA-in-the-Loop Wizard

To launch the FIL Wizard, select Tools > Verification Wizards > FPGA-in-the-Loop (FIL)... in the model window or enter the following command at the MATLAB prompt:

>> filWizard;

4.1 Hardware Options

Select a board in the board list.

4.2 Source Files

a. Add the previously generated HDL source files for the Streaming Video Sharpening subsystem.
b. Select Streaming_2_D_FIR_Filter.vhd as the Top-level file.

4.3 DUT I/O Ports

Do not change anything in this view.

4.4 Build Options

a. Select an output folder.
b. Click Build to build the FIL block and the FPGA programming file.
During the build process, the following actions occur:

A FIL block named Streaming_2_D_FIR_Filter is generated in a new model. Do not close this model.
After new model generation, the FIL Wizard opens a command window where the FPGA design software performs synthesis, fit, place-and-route, timing analysis, and FPGA programming file generation. When the FPGA design software process is finished, a message in the command window lets you know you can close the window. Close the window.

c. Close the fil_videosharp_sim model.

5. Open and Complete the Simulink Model for FIL

a. Open the fil_videosharp_fpga.slx.
b. Copy in it the previously generated FIL block to fil_videosharp_fpga.slx where it say "Replace this with FIL block"

6. Configure FIL Block

a. Double-click the FIL block in the Streaming Video Sharpening with FPGA-in-the-Loop model to open the block mask.
b. Click Load.
c. Click OK to close the block mask.

7. Run FIL Simulation

Run the simulation for 10s and observe the performance improvement.

This concludes the Video Processing Acceleration using FPGA-In-the-Loop example.

FPGA Monero Working IP-Cores Shares

DownLoad - Click Here - SiaFpgaMiner

This project is a VHDL FPGA core that implements an optimized Blake2b pipeline to mine Siacoin.

Motivation

When CPU mining got crowded in the earlier years of cryptocurrencies, many started mining Bitcoin with FPGAs. The time arrived when it made sense to invest millions in ASIC development, which outperformed FPGAs by several orders of magnitude, kicking them out of the game. The complexity and cost of developing ASICs monopolized Bitcoin mining, leading to relatively dangerous mining centralization. Therefore, emerging altcoins decided to base their PoW puzzle on other algorithms that wouldn't give ASICs an unfair advantage (i.e. ASIC-resistant). The most popular mechanism has been designing the algorithm to be memory-hard (i.e. dependent on memory accesses), which makes memory bandwith the computing bottleneck. This gives GPUs an edge over ASICS, effectively democratizing access to mining hardware since GPUs are consumer electronics. Ethereum is a clear example of it with its Ethash PoW algorithm.
Siacoin is an example of a coin without a memory-hard PoW algorith and ~~no ASIC miners~~ some ASIC miners are being rolled out (see Obelisk and Antminer A3). So was a perfect candidate for FPGA mining! (more for fun than profit)

Design theory

To yield the highest posible hash rate, a fully unrolled pipeline was implemented with resources dedicated to every operation of every round of the Blake2b hash computation. It takes 96 clock cycles to fill the pipeline and start getting valid results (4 clocks per 'G' x 2 'G' per round x 12 rounds).

MixG.vhd implements the basic 'G' function in 4 steps. Eight and two-step variations were explored but four steps gave the best balance between resource usage and timing.
QuadG.vhd is just a wrapper that instantiates 4 MixG to process the full 16-word vectors and make the higher level files easier to understand.
Blake2bMinerCore.vhd instantiates the MixG components for all rounds and wires their inputs and outputs appropiately. Nonce generation and distribution logic also lives in this file.
/Example contains an example instantiation of Blake2bMinerCore interfacing a host via UART. It includes a very minimalist Python script to interface the FPGA to a Sia node for mining.

MixG

The diagram below shows the pipeline structure of a single MixG. Four of these are instantiated in parall to constitute QuadGs, which are chained in series to form rounds.

The gray A, B, C, D boxes contain combinatorial operations to add and rotate bits according to the G function specification. The white two-cell boxes represent two 64-bit pipelining registers to store results from the combinatorial logic that are used later on the process.

Nonce Generation and Distribution

Pipelining the hash vector throughout the chain implies heavy register usage and there is no way around it. Fortunately the X/Y message feeds aren't as resource-demanding because the work header can remain constant for a given block period, with the exception of the nonce field, which must obviously be changing all the time to yield unique hashes. Therefore, the nonce field must be tracked or kept in memory for when a given step in the mixing logic requires it. The most simplistic approach would be to make a huge N-bit wide shift register to "drag" the nonce corresponding to each clock cycle across the pipeline. This is not an ideal solution, for we would require N flip-flops (e.g. 48-bit counter) times the number of clock cycles it takes to cross the pipeline (48 x 96 = 4608 FF!)
Luckily, the nonce field is only used once per round (12 times total). This allows hooking up 12 counters statically to the X or Y input where the nonce part of the message is fed in each round. To make the counter output the value of the nonce corresponding to a given cycle, the counters' initial values are offset by the amount of clock cycles between them. The following diagram illustrates the point:

In this case the offsets show that the nonce used in round zero will be consumed by round one 8 clock cycles after, by round two 20 cycles after, and so on. (The distance in clock cycles between counters is defined by the Blake2b message schedule)

Implementation results

It is evident that a single core is too big to fit in a regular affordable FPGA device. A ballpark estimate of the flip-flop resources a single core could use:

64-bits per word x 16 word registers per MixG x 4 MixG per QuadG x 2 QuadG per round x 12 rounds = 98,308 registers (not considering nonce counters and other pieces of logic).

The design won't fit in your regular Spartan 6 dev board, which is why I built it for a Kintex 7 410 FPGA. Here are some of my compile tests:

Cores	Clock	Hashrate	Mix steps	Strategy	Utilization	Worst Setup Slack	Worst Hold Slack	Failures	Notes
1	200	200	4	Default	18.00%	0.168	–	0
2	200	400	4	Default	38.00%	–	–	0
3	200	600	4	Default	56.00%	-0.246	–	602 failing endpoints
3	200	600	4	Explore	56.00%	-0.246	0.011	602 failing endpoints
3	166.67	500.01	4	Default	56.00%	0.132	0.02	0
4	166.67	666.68	4	Default	75.00%	0.051	0.009	0
5	166	830	4	Explore	–	–	–	Placing error
4	173.33	693.32	4	Explore	75.00%	0.039	0	0
4	173.33	693.32	4	Explore	75.00%	0.17	0.022	0	1 BUFGs per core

As seen in the table, the highest number of cores I was able to instantiate was 4 and the highest clock flequency that met timing was 173.33 MHz.
~700 MH/s is no better than a mediocre GPU, but power draw is way less! (hey, I did say it was for fun)

Further work

Investigate BRAM as alternative to flip-flops (unlikely to fit the needs of this application).
Fine-tune a higher clock frequency to squeeze out a few more MH/s.
Porting to Blake-256 for Decred mining. That variant adds two rounds but words are half as wide, so fitting ~2x the number of cores sounds possible.
Do more in-depth tests with different number of steps in the G function (timing-resources tradeoff).
Play more with custom implementation strategies.

Resources

Official Blake2b spec. The example computation vectors are extremely useful to test and debug the logic.
Blake Wikipedia page that summarizes the algorithm pretty well.

Anti-AdBlocker