Page 2 of 4

Secure Boot Concept for the Zynq Ultrascale+ MPSoC

The complexity of today’s multiprocessor System-on-Chip (MPSoC) can lead to major security risks in embedded designs, as the available security functions are often not or insufficiently utilized.

InES (Institute of Embedded Systems at ZHAW) developed a reference design which demonstrates a concept of a secure boot implementation and runtime system on a Xilinx Zynq Ultrascale+.

The security concept includes dedicated on-chip security features like AES, RSA and hashing core. The reference design also describes how to implement a voltage and temperature tamper detection. In addition, secure key storage and various methods to minimize key usage are provided. The demonstrator implements the ARM Trust Zone technology with OP-TEE as a secure operating system.

Implementation examples and usage description of the Linux Crypto-API, using the dedicated cryptographic cores, are also included in the documentation.

The modular open-source reference design is provided on GitHub, which contains implementation examples for all the above features.

Please find the link to our secure boot reference design here:

Linux Driver for TI DS90UB95x FPD-Link III serializer and deserializer

The Institute of Embedded Systems at ZHAW developed a driver for the deserializer DS90UB954 and serializer DS90UB953 from Texas Instruments. The driver was tested on the RaspberryPi 4, NVIDIA Nano and NVIDIA Xavier modules.

FPD-Link III is a cost-effective solution for high speed video transmission. The video data, the bidirectional configuration signal and the power supply are all transmitted over a single coaxial cable. At the length of up to 15 m the coaxial cable can support data rates of 6 Gbps.

In order to use our FDP-Link III driver (link driver) on different hardware, the driver was designed to be highly configurable. Additionally, the driver can be used with various FPD-Link III cameras. Instead of integrating the FPD-Link III part into already existing camera sensor drivers, the link driver is standalone and creates a transparent CSI and I2C link to the data source. This means that after the driver has set up the FPD-Link III connection, the sensor driver can be used without modifications.

Transparent Link

The following figure shows an example of a sensor driver which controls a camera sensor directly over I2C and configures the video interface

Camera sensor pipeline

Adding an FDP-Link III connection means that the I2C interface and the video channel will be routed through the coaxial cable. The deserializer and serializer are responsible for the conversion and forwarding of I2C and video data. The following figure shows the link driver configuring the deserializer and serializer over I2C. Once the setup is done, the sensor driver can configure the camera sensor over the I2C interface. Since the link is transparent, no changes have to be made to the sensor driver.

Camera sensor pipeline with FPD-Link III

Configurability of the Driver

The following configurations can be done in the device tree:
– I2C address of deserializer/serializer
– Number of MIPI CSI lanes (the camera sensor and the hardware do not need to have the same number of lanes)
– MIPI CSI lane speed
– Enable/disable continuous clock
– Enable/disable test pattern of deserializer/serializer
– Virtual channel ID mapping
– Configure GPIOs of deserializer/serializer
– Set I2C alias addresses

Follow this link for the Driver source code and documentation A detailed description of device tree configurations can be found in ds90ub95.txt.

A hardware reference design for the Raspberry Pi can be found in our blog: Power over Coax FPD-Link III Link Streaming Adapter for Raspberry PI CSI-Interface

Power over Coax FPD-Link III Link Streaming Adapter for Raspberry PI CSI-Interface

The Institute of Embedded Systems at ZHAW has developed an open source adapter which allows streaming of a CSI-2 Camera interface to a Raspberry Pi. This allows connecting cameras with CSI interface via a long distance cable (up to 15m) to the CSI-2 input of a Raspberry Pi.

The long range adapter uses FPD-Link III high-speed video transmission technology by utilizing the existing MIPI CSI-2 interfaces of cameras and Raspberry Pi. A deserializer converts the FPD-Link III signal to CSI, is based on the DS90UB954 from Texas Instruments (TI). The counterpart located at the Raspberry Pi camera is based on the serializer DS90UB953 from TI. With these two components it is possible to transmit high-speed video data over a single coaxial cable which can be up to 15 meters long. Another advantage of FPD-Link III is the power over coax (PoC) capability, which transmits the power required for the camera sensor directly from the Raspberry Pi.
The schematics and PCB designs are open source and available here.
A driver for the Texas Instruments DS90UB95x serializer and deserializer can be found in our blog Linux Driver for TI DS90UB95x FPD-Link III serializer and deserializer

Deep Learning for Classifying Food Waste

Amin Mazloumian
Hans-Joachim Gelke
Matthias Rosenthal

Institute of Embedded Systems Zurich University of Applied Sciences Zurich, Switzerland

One third of food produced in the world for human consumption – approximately 1.3 billion tons – is lost or wasted every year. By classifying food waste of individual consumers and raising awareness of the measures, avoidable food waste can be significantly reduced. In this research, we use deep learning to classify food waste in half a million images captured by cameras installed on top of food waste bins. We specifically designed a deep neural network that classifies food waste for every time food waste is thrown in the waste bins. Our method presents how deep learning networks can be tailored to best learn from available training data.

In this paper, a more informative view to food waste production behavior at the consumption stage is achieved through classifying food waste in waste bins. The classification task is feasible by processing images captured from food waste in the waste bins. The images are captured by installing cameras on top of the waste bins and monitoring the top surfaces of food waste in the bins. This study focuses on classifying food waste in half a million images captured by cameras installed on top of waste bins. The system design of a smart garbage systems that uses our classification is out of the scope of this study.

The automatic classification of food waste in waste bins is technically a difficult computer vision task for the following reasons.
a) It is visually hard to differentiate between edible and not-edible food waste. As an example consider distinguishing between eggs and empty eggshells.

b) Same food classes come in a wide variety of textures and colors if cooked or processed. c) Liquid food waste, e.g. soups and stews, and soft food waste, e.g. chopped vegetables and salads, can largely hide and cover visual features of other food classes.

In this research, we adopt a deep convolutional neural network approach for classifying food waste in waste bins. Deep convolutional neural networks are supervised machine learning algorithms that are able to perform complicated tasks on images, videos, sound, text, etc. The deep neural networks are composed of tens of convolutional layers (deep) that train on labelled data (supervised training) to learn target tasks. Labelled training data is composed of thousands of input- output pairs. In the training phase, the networks learn to produce the expected training output (labels) given the training input data. The training is performed by calculating millions of parameter values for feature extraction convolutional filters. In image processing, first layers of trained deep convolutional networks detect simple features, e.g. edges and corners. Based on the low level features extracted in first layers, deeper layers detect higher level features such as contours and shapes.

For more information please read our paper:
Deep Learning for Classifying Food Waste

Artificial Intelligence on Microcontrollers

Using artificial intelligence algorithms, specifically neural networks on microcontrollers offers several possibilities but reveals challenges: limited memory, low computing power and no operating system. In addition, an efficient workflow to port neural networks algorithms to microcontrollers is required. Currently, several frameworks that can be used to port neural networks to microcontrollers are available. We evaluated and compared four of them:

The frameworks differ considerably in terms of workflow, features and performance. Depending on the application, one has to select the best suited framework. On our github page we offer guides and example applications which can help you to get started with those frameworks!

The neural networks that are generated with all those frameworks are static. This means that once they are integrated into the firmware they cant be changed anymore. However, it would be beneficial if the neural network running on the microcontroller could adapt itself to a changing domain. We developed an algorithm (emb-adta) which could be used for unsupervised domain adaptation on microcontrollers. The prototype python implementation is also available on github!

ZNNN the Framework to Port Neural Networks to FPGA


Due to their hardware architecture, Field Programmable Gate Arrays (FPGAs) are optimally suited for the execution of machine learning algorithms. These algorithms require the calculation of millions or even billions of multiplications for each input. To successfully accelerate a neural network, parallel execution of multiplication is the key. The obvious suggestion for parallel execution is a Graphics Processing Unit (GPU), offering hundreds of execution cores. For years, GPU vendors have been adapting the capabilities of their GPUs to meet the demand for narrow integer and floating-point data types used in AI. But still, a GPU will execute one Neural Network (NN) layer after the other, with data transfers between computation cores and memory.

Implementing Neural Networks in FPGAs has several advantages:

  1. Flexible bit widths for both integer and fixed-point data types.
  2. Large numbers of scalable hardware multiplier cores.
  3. Flexibility due to tightly coupled memory blocks with wide parallel interfaces, allowing access to vast numbers of data points in each clock cycle.

Considering the previous points, the FPGA clearly provides all the resources required for highly parallel execution of NN algorithms.

Existing frameworks

Unfortunately, the act of porting a trained network to HDL code for implementation in the FPGA is not trivial. FPGA vendors have started to provide frameworks for running NNs in their devices. These include HDL-coded NN-coprocessor cores as IP blocks and matching compilers to convert a trained NN into a binary executable which will run on the coprocessor. However, these frameworks are based on a specific software library and therefore require a processor core running an operating system and controlling software. This means that the NN input data and network parameters are transferred from the software to the coprocessor in order to calculate the output of the NN. The output values are then transferred back to the software for interpretation.

This is substantial overhead, especially if the input data is sampled or preprocessed in the FPGA fabric. It would be preferable to implement the neural network entirely in the FPGA fabric, capable of running independently from software.

ZHAW Native Neural Network

The ZHAW Native Neural Network (ZNNN) framework is aimed at the following goals:

  • Input may be received directly from FPGA fabric
  • Inference independent of CPU and software
  • Minimal latency
  • Maximal throughput
  • No access to DRAM required

With these goals in mind, it is obvious that we trade in flexibility to gain performance and simplicity. The NN is implemented as a rigid block, designed for one single NN application. To allow for minimum latency, we use dedicated multipliers for each neuron, and each layer has its own memory block for the weights and biases. Ping-Pong buffers allow to process one input vector in one layer while receiving the next input vector. With this structure, pipelining delays can be minimized to the execution time of the largest layer.

Our framework will take as input a structured text file with a description of the NN, including number of inputs, data-bit widths, fix point precision, number of neurons per layer for fully connected layers, number of filters and kernel size for convolutional layers, max pool and flatting layers. From this configuration file and a training and verification data set, it will generate:

  • Input A trained NN model
  • A behavioural model written in C programming language to generate a data set for verification of the VHDL code in simulation
  • A test bench for verification
  • The VHDL code of the NN ready for instantiation in your design.

Dedicated multipliers for the neurons will use a significant amount of the available resources and it must be noted that larger networks will require considerably larger devices. This will not be suitable for all NN applications. Our ZNNN framework is optimally suited for applications such as industrial machine surveillance where only small networks will meet the latency requirements while still achieving the required accuracy.


A direct comparison of ZNNN with the Deep Learning Processing Unit (DPU) coprocessor from Xilinx shows that both have their justification, depending on the application at hand:

If you need to run multiple, different neural networks on your FPGA with a fair performance, you should go with the Xilinx solution. The DPU allows to process different NN on the same implementation but is restricted to software-controlled operation.

If performance is essential and your application needs a single neural network, you should use the ZNNN.

The amount of resources in a Xilinx Zynq UltraScale+ EG9 device used by the different solutions is shown in the following table. The ‘Xilinx DPU’ will always use roughly the same amount of resources (depending on its configuration). It can process various neural networks, including very large ones, with a trade-off in throughput and processing time (latency). The resource requirements of ZNNN strongly depend on the size and type of NN you implement. ‘ZNNN MNIST’ is a NN with only dense layers, trained for the well-known MNIST example. MNIST is a NN application that recognizes handwritten numbers. ‘ZNNN CONV’ is a NN using 1D-convolutional layers for non-linear signal processing in an industrial application which accepts 64 data points as input. ‘ZNNN VIS’ is a dense network with 2304 inputs and one single output, used for an industrial application. According to the large number of inputs, the number of multipliers required is very large.

NN LUT BRAM DSP Throughput (FPS)
Xilinx DPU 47 k (17%) 132 (14%) 326 (13%) 2.5 k
ZNNN MNIST 34 k (12%) 182 (20%) 947 (37%) 8510 k
ZNNN CONV 20 k (7%) 124 (13%) 712 (28%) 291 k
ZNNN VIS 87 k (32%) 182 (20%) 2467 (98%) 4081 k

Throughput of a NN can be measured in the number of inputs processed per second (FPS). On the Xilinx DPU, the whole NN is processed for one set of input data before the next set can be passed. Our ZNNN framework implements layer pipelining, meaning that as soon as the first layer is processed, the next input set can be accepted as shown in the following figure.

The latency is slightly increased because not all layers have the same processing time, but all the layers are processed in parallel. In return, the delay between two inputs is greatly reduced, allowing to process more FPS. Because ZNNN includes all the required weight parameters in the design, these don’t need to be loaded into the FPGA at runtime. This allows to increase the FPS by orders of magnitude in comparison with the Xilinx DPU.


Both the power and the cost of ZNNNs become visible in comparison with the DPU: The DPU offers the flexibility to run various NNs on one implementation, including larger NNs like Resnet50. The DPU is controlled by software and therefore requires a CPU running a Linux operating system. ZNNN implementations are ideal for small NNs and run independently from software, take their input directly from FPGA and process orders of magnitude faster than the DPU!

The ZNNN framework is suitable for low latency, high throughput execution of small convolutional and fully connected NNs. It generates VHDL code for a specific NN implementation in FGPA without the development overhead of hand-written HDL code and testbenches. The processing performance of the ZNNN is orders of magnitude faster than Xilinx’ DPU thanks to a high level of pipelining.

We are aware that the ZNNN implementation can require more FPGA resources than the DPU, but there are industrial applications where this approach is a perfect fit and the achieved performance meets the requirements. With the ZNNN running independently of CPU and software and the input data coming directly from the FPGA fabric, we have principally no bottlenecks in the design.

Our team will continually improve the ZNNN framework by making trade-offs between resource requirements and performance configurable.

Interfacing FPD-Link III to a x86-PC via PCI Express

A computer with a GPU combined with an FPGA is a powerful tool for high speed video processing. An FPGA can preprocess multiple video streams in realtime and then send the data to the GPU for further processing.

FPD-Link III is a cost-effective solution for high speed video transmission. It has made a name for itself for its widespread use in the automotive industry. The transmission can be done over a simple coaxial cable but includes not just a video data stream, but also a bidirectional configuration channel and a power supply for the camera.

The purpose of this project is to design hardware which makes it possible to take full advantage of the developed FPGA-GPU co design [1] and to combine it with an FPD-Link III interface. The resulting baseboard utilizes PCI Express implemented in an FPGA which allows connecting up to 6 FPD-Link III. The FPGA is embedded in a system-on-chip and could potentially also be used stand-alone. As a further video source option, it also includes two connectors for MIPI CSI cameras. These are designed to be compatible with RaspberryPi cameras.

[1] See our blog “Direct communication between FPGA and GPU using Frame Based DMA (FDMA)” for solving the bottleneck between CPU and GPU


In a standard computer, PCI Express (PCIe) offers the possibility for two devices to exchange data on up to 16 high speed data lanes. The CPU is master of the PCIe interface and therefore usually initiates data transfers. This causes overhead and limits the maximum transfer speed for certain applications. The paper FPGA-GPU Codesign from xxx implements solutions to transfer data directly from an FPGA over PCIe to a GPU without the CPU being a bottleneck for the data througput. This approach can be especially useful for applications with high resolution video streams which need to be processed in real time. Live video streams from multiple cameras can be preprocessed in the FPGA and then be transmitted via PCI Express to the GPU for further processing. This thesis is about implementing hardware suitable to take full advantage of this idea.

The goal is to be able to connect multiple cameras to a computer via a PCIe baseboard with an FPGA. The camera interface chosen for this baseboard is FPD-Link III.


FPD-Link III is a cost-effective solution for high speed video transmission. It has gained relevance in automotive applications. Cameras in cars are becoming more common. Nowadays even low cost cars come with a rear camera to assist while parking. FPD-Link III can also be used in different industrial applications, especially for real time use-cases which require high-bandwidth transmissions.

The goal of this project is to develop a Baseboard that makes it possible to connect multiple FPD-Link III cameras to a standard computer with high data rates. The video data from the cameras should be preprocessed in the FPGA and then be forwarded to the computer via PCI Express. Cameras also require to be configured. This configuration will be handled in the FPGA over FPD-Link III.

The Basics of FPD-Link III

Flat panel display link III (FPD-Link III) can be used to receive data from a camera or to send data to a display. The well known standards for high speed video transmission on the consumer market are HDMI, DisplayPort and USB. However, these cables are expensive and better suited for short distances. FPD-Link III can be used with coaxial or a shielded twisted-pair (STP) cables. A 15 m coaxial cable supports data rates up to 6 Gbps and a 10 m long STQ (shielded twisted quad) cable supports up to about 5 Gbps. But FPD-Link III does not only transmit video data. In addition to the video channel, there is also a bidirectional control channel. This is needed for a processor to configure the camera sensor. In case of a display, the control channel can for example be used to send commands from the touchscreen to the processor. The video channel occupies the frequency range between 70MHz and 700MHz whereas the control channel lies between 1MHz and 5MHz.
An additional feature of FPD-Link III is Power over coax which offers the possibility to power the image sensor over the coax cable. This eliminates the need for a further cable for power supply.

FPD-Link III to PCIe Video Pipeline

As shown in the figure below, the video is transmitted from a camera sensor to a serializer which sends the video over FPD-Link III in a coaxial cable to the deserializer. The deserializer transmits the data to the FPGA over MIPI CSI-2 D-PHY. In the FPGA, the data can be preprocessed and then be sent over PCIe to the memory on the computer. There will be multiple deserializers on the baseboard so multiple cameras can be connected.

The concept of the Baseboard

The following figure shows a sketch of the concept for the baseboard. The path of the video data is colored in bright green. On the left side there are six coax connectors for the FPD-Link III interface (number of coax connectors is limited to six because of the defined maximum width of a PCIe card). The data goes through the deserializer to the FPGA. From the FPGA the data is transmitted to a computer over PCIe. In addition to the FPD-Link III connectors there are two MIPI CSI-2 D-PHY connectors which are compatable with the RaspberryPi-Camera. This gives the user an additional option for a video source.

The main focus of the project is the deserializer for FPD-Link III, the integration of a SoC FPGA and the PCIe interface. The company Enclustra offers a selection of SoC modules which entail a Xilinx MPSoC (Multiprocessor system-on-chip), SDRAM, flash memory and more. The module can be mounted to the baseboard via connectors. The figure above shows additional hardware and interfaces which are needed for an operating baseboard. The SoC needs a JTAG interface for programming and debugging. An SD-card slot is added and can be used as the boot device. The boot mode switch can be used to change the source device for the boot process. The SoC can be reset over a button and a status LED gives further information about the state of the SoC. The UART (Universal Asynchronous Receiver Transmitter) is needed for the console output of the processor. An ethernet connector makes it possible to get access to the processor with SSH (Secure Shell).

A power switch makes it possible to choose between an external power supply or the 12V supplied over the PCIe interface from the computer. The external supply is needed when the SoC should be programmed before the computer is booted. The connection to PCIe devices are established while booting in the bios, which means the FPGA should be programmed before the computer turns on. The external supply is also useful when the board consumes more power than the computer can supply.

PCIe can be used in four different lane configurations: 1-, 4-, 8- or 16-lanes. The PCIe switch is used to choose between these options.

FPD-Link III DeSerializer

The deserializer converts the FPD-Link III signal to MIPI CSI-2 which can be connected to the FPGA. Texas Instruments (TI) offers a variety of solutions for FPD-Link III serializers and deserializers. The requirements for choosing a deserializer are the following:

• Input: FPD-Link III LVDS
• Output: MIPI CSI-2
• Able to connect to 2+MP (mega pixel) cameras

TI provides two deserializers which meet the given requirements. These are DS90UB960-Q1 (960) and DS90UB954-Q1 (954).
The 960 and 954 models have the same maximum data rates for FPD-Link III and MIPI CSI-2. However, 954 has more GPIO pins per camera. 954 has 7 GPIOs for 2 cameras and 960 has 8 GPIOs for 4 cameras. GPIO signals are useful to get diagnostics of the deserializer, but they can also be used to connect directly to the camera sensor board. Some cameras need for example an enable signal which can be set by the processor over these GPIOs. The deserializer chosen for this baseboard is the DS90UB954-Q1.
Below picture shows the available serializers and deserializers from TI

Implementation of the Deserializer

The DS90UB954-Q1 deserializer can be used with one or two camera sensors over FPD-Link III. It supports 2MP@60fps and 4MP@40fps cameras. The two input channels RIN0/RIN1 can be enabled and disabled through registers of the deserializer (Register: RX_PORT_CTL, 0x0C). The input channels can be used as single ended (coaxial channel) or as differential (STP). This baseboard uses single ended coaxial connections. This means the RN- port of a channel is connected to ground with a 15nF-capacitor and a 50 ohm resistor. The RN+ port is connected to the conductor inside the coax connector with a 33nF capacitor in series. This capacitor blocks the receiver ports of the deserializer from any DC voltage. This is especially important if power over coax is used.
Power over coax (PoC) makes it possible to use the coaxial cable to supply power to the camera sensor. The output of a power supply is connected to the coaxial connector with a filter in series. This filter is needed to shield the power supply from the AC signal transmitted between deserializer and camera sensor over FPD-Link III.

Typically, the PoC voltage is between 5V and 36V baseboard. Before connecting a camera to the baseboard it must always be checked what PoC voltage is tolerated by the camera board. If needed, the PoC voltage can be cut off from the coax connector design provided from TI.

The FPD-Link III PCIe baseboard was successfully designed and produced. Parts of the baseboard are tested and verify that the baseboard as such is functional. The components for the MPSoC are working properly and the baseboard was successfully detected over PCIe. Some tests showed that there is still work to do in the bring-up of the baseboard.

A task that is still open is the debugging of the PCIe to get a working link with lane width of 8 and 16. The link to a PCIe device is established during the boot of the BIOS (basic input/output system) which is demanding to debug.


Once all these individual parts are completed, they can to be combined into one single system that collects video streams from multiple cameras and transfers the data to a computer via PCIe.

Direct communication between FPGA and GPU using Frame Based DMA (FDMA)

By Philipp Huber, Hans-Joachim Gelke, Matthias Rosenthal

GPUs with their immense parallelization are best fitted for real-time video and signal processing. However, in a real-time system, the direct high-speed interface to the signal sources, such as cameras or sensors, is often missing. For this task, field programmable gate arrays (FPGA) are ideal for capturing and preprocessing multiple video streams or high speed sensor data in real time.
Besides the partitioning of computational tasks between GPU and FPGA the direct communication between GPU and FPGA is the key challenge in such a design. However, since the Data communication is typically controlled by the CPU, this often becomes the bottleneck of the system

This blog shows a new method for an efficient GPU-FPGA co-design called Frame based DMA (FDMA) which is based on GPUDirect, but without using the CPU for data transfer. This versatile solution can be used for a variety of different applications, where hard real-time capabilities are required.

The Institute of Embedded Systems, an entity of Zurich University of Applied Sciences (ZHAW), developed the FDMA methodology for direct data transfers between the FPGA and the GPU. This IP has been compared with an implementation based on the Xilinx XDMA IP.

GPUDirect DMA in NVIDIA Devices
Nvidia Quadro and Tesla GPUs support GPUDirect RDMA mapping of GPU RAM to the Linux IO-memory address space.
The CPU and other PCIe devices can access the mapped memory directly. Using GPUDirect the FPGA has direct access to the mapped GPU RAM.

Fig. 1: Direct transfer without CPU involvement

XDMA Implementation from Xilinx
This implementation is based on the XDMA IP from Xilinx. With this IP the host can initialize any DMA transfer between the FPGA internal address space and the I/O-memory address space. This allows direct transfers between the FPGA internal address space and the mapped GPU RAM. However, the host has to initialize each data transfer. As a result, an application has to run on the host, which is listening to messages from the devices. This application starts the data transfers if the devices are ready.

Fig. 2: Xilinx XDMA IP based implementation

ZHAW FDMA Implementation
For this concept of direct FPGA-GPU communication, a special DMA-IP was developed at ZHAW InES. This DMA-IP is called Frame based DMA (FDMA) and is designed to work without any host interactions after system setup. This approach uses the AXI to PCIe Bridge IP from Xilinx to translate AXI transactions to PCIe transactions. FDMA supports multiple RX and TX buffers in the GPU. This allows using one buffer for reading or writing and the other buffers for GPU processing. Each GPU buffer has a flag in the GPU RAM. This flag indicates who has access to this buffer and is used for synchronization between the GPU and the FPGA.

Fig. 3: InES frame based DMA implementation

Achieved Data rates of FDMA Implementation
For the following measurements, a Xilinx Kintex 7 FPGA with PCIe Gen2x4 and an Nvidia Quadro P2000 PCIe Gen3x16 have been used.
The data rates with the two implementations have been measured with the Xilinx Kintex-7 FPGA and the Nvidia Quadro P2000. The slowest link between them is PCIe Gen2x4 with a link speed of 16Gbit/s. The Figure 4 shows the average data rate for different transfer sizes. FDMA is faster for small transfers, because the host doesn’t have to initialize every transfer. For larger block sizes the XDMA implementation is faster, because of performance issues in the Xilinx AXI to PCIe Bridge IP.

Fig. 4: Average data rates comparing FDMA with XDMA

Resulting Transaction Jitter FDMA vs. XDMA
For real time data processing a low execution jitter is needed. This execution jitter was measured with both implementations by measuring the transfer rate of 10’000’000 data transfers of 32 bytes. Based on these measurements, the three distributions shown in Figure 5 to 7 have been calculated

Fig. 5: XDMA transfer jitter
Fig. 6: FDMA transfer jitter with FDMA and X11

As these measurements reveal, the XDMA implementation has a huge transaction jitter. This is the case because the Linux host has to initialize every single transfer and Linux is not a real time operating system. The two measurements of the FDMA implementation reveal that there is still a small transaction jitter when the X11 server is running on the same GPU but it disappears nearly completely when disabling the X11 server, as shown in the drawing below.

Fig. 7: FDMA transfer jitter when X11 Server is disabled

Both implementations, FDMA and XDMA make use of the direct transfers between the FPGA and the GPU and therefore reduce the load on the CPU. The FDMA developed at Institute of Embedded Systems does not need any host interaction after setup and such transfer jitter is extremely low. This makes the FDMA implementation perfect for time critical streaming-applications.

For further information please contact or

IntEdgPerf: A new AI benchmark for embedded processors

IntEdgPerf is a new benchmark for running machine learning algorithms on embedded devices. It was developed at the Institute of Embedded Systems (InES) at the Zürich University of Applied Sciences. IntEdgPerf is a framework that allows a fair comparison between different embedded processors that can be used for executing neural networks.

The area of embedded AI is a quickly emerging market where many hardware manufacturers provide accelerators and platforms. So far, metrics and benchmarks provided by the manufacturers are not usable for comparison.

IntEdgPerf incorporates a collection of multiple TensorFlow AI models. It measures the time for the computations of the machine learning algorithm on embedded devices. The benchmark is dynamically extendable by allowing new machine learning models to be integrated. Hardware specific calls can be implemented as modules and integrated in the benchmark.

The benchmark was verified on multiple processors and machine learning accelerators such as Nvidia Quadro K620 GPU, Nvidia Jetson TX1 & TX2 and an Intel Xeon E3-1270V5. Also hardware accelerators, without a direct interface to TensorFlow, such as the Intel Movidius Neural Compute Stick were benchmarked. All tests used unoptimized networks and systems.

The following convolutional models were used in the test:

  • CNN3: This is a fully convolutional neural network, using only convolutional layers. Instead of using maxpool-layers a stride size of two is used to decrease size (the stride size defines the pixel shift of the convolution filter).
  • CNN3Maxpool: The effect of maxpool on the performance is shown by comparing the previously described network to the same version, utilizing maxpool instead of stride sizes higher than one.

  • CNN2FC1 and CNN2MaxpoolFC1: A third and fourth comparison can be made when replacing the last layer of the two previous networks with a fully connected layer. This allows more flexibility for the network for the input size of the image.

For more information, please visit the benchmark’s website at

Machine learning on Cortex-M4 using Keras and ARM-CMSIS-NN

We have developed a simple software  to show how a custom keras model can be automatically translated into c-code. The generated c-code can, in combination with the ARM-CMSIS-NN functions, be used to run neural-net calculations in an efficient way on an embedded micro-controller such as the CORTEX-M4. 

The example software on GitHub has also firmware which runs on the STM32F4-Discorevy BoardPart of the firmware was generated with cubeMX.

The example software has a MNIST classifier which can classify handwritten digits.

See for more Details.

« Older posts Newer posts »