Author: gelk

Direct communication between FPGA and GPU using Frame Based DMA (FDMA)

By Philipp Huber, Hans-Joachim Gelke, Matthias Rosenthal

GPUs with their immense parallelization are best fitted for real-time video and signal processing. However, in a real-time system, the direct high-speed interface to the signal sources, such as cameras or sensors, is often missing. For this task, field programmable gate arrays (FPGA) are ideal for capturing and preprocessing multiple video streams or high speed sensor data in real time.
Besides the partitioning of computational tasks between GPU and FPGA the direct communication between GPU and FPGA is the key challenge in such a design. However, since the Data communication is typically controlled by the CPU, this often becomes the bottleneck of the system

This blog shows a new method for an efficient GPU-FPGA co-design called Frame based DMA (FDMA) which is based on GPUDirect, but without using the CPU for data transfer. This versatile solution can be used for a variety of different applications, where hard real-time capabilities are required.

The Institute of Embedded Systems, an entity of Zurich University of Applied Sciences (ZHAW), developed the FDMA methodology for direct data transfers between the FPGA and the GPU. This IP has been compared with an implementation based on the Xilinx XDMA IP.

GPUDirect DMA in NVIDIA Devices
Nvidia Quadro and Tesla GPUs support GPUDirect RDMA mapping of GPU RAM to the Linux IO-memory address space.
The CPU and other PCIe devices can access the mapped memory directly. Using GPUDirect the FPGA has direct access to the mapped GPU RAM.

Fig. 1: Direct transfer without CPU involvement

XDMA Implementation from Xilinx
This implementation is based on the XDMA IP from Xilinx. With this IP the host can initialize any DMA transfer between the FPGA internal address space and the I/O-memory address space. This allows direct transfers between the FPGA internal address space and the mapped GPU RAM. However, the host has to initialize each data transfer. As a result, an application has to run on the host, which is listening to messages from the devices. This application starts the data transfers if the devices are ready.

Fig. 2: Xilinx XDMA IP based implementation

ZHAW FDMA Implementation
For this concept of direct FPGA-GPU communication, a special DMA-IP was developed at ZHAW InES. This DMA-IP is called Frame based DMA (FDMA) and is designed to work without any host interactions after system setup. This approach uses the AXI to PCIe Bridge IP from Xilinx to translate AXI transactions to PCIe transactions. FDMA supports multiple RX and TX buffers in the GPU. This allows using one buffer for reading or writing and the other buffers for GPU processing. Each GPU buffer has a flag in the GPU RAM. This flag indicates who has access to this buffer and is used for synchronization between the GPU and the FPGA.

Fig. 3: InES frame based DMA implementation

Achieved Data rates of FDMA Implementation
For the following measurements, a Xilinx Kintex 7 FPGA with PCIe Gen2x4 and an Nvidia Quadro P2000 PCIe Gen3x16 have been used.
The data rates with the two implementations have been measured with the Xilinx Kintex-7 FPGA and the Nvidia Quadro P2000. The slowest link between them is PCIe Gen2x4 with a link speed of 16Gbit/s. The Figure 4 shows the average data rate for different transfer sizes. FDMA is faster for small transfers, because the host doesn’t have to initialize every transfer. For larger block sizes the XDMA implementation is faster, because of performance issues in the Xilinx AXI to PCIe Bridge IP.

Fig. 4: Average data rates comparing FDMA with XDMA

Resulting Transaction Jitter FDMA vs. XDMA
For real time data processing a low execution jitter is needed. This execution jitter was measured with both implementations by measuring the transfer rate of 10’000’000 data transfers of 32 bytes. Based on these measurements, the three distributions shown in Figure 5 to 7 have been calculated

Fig. 5: XDMA transfer jitter
Fig. 6: FDMA transfer jitter with FDMA and X11

As these measurements reveal, the XDMA implementation has a huge transaction jitter. This is the case because the Linux host has to initialize every single transfer and Linux is not a real time operating system. The two measurements of the FDMA implementation reveal that there is still a small transaction jitter when the X11 server is running on the same GPU but it disappears nearly completely when disabling the X11 server, as shown in the drawing below.

Fig. 7: FDMA transfer jitter when X11 Server is disabled

Conclusion
Both implementations, FDMA and XDMA make use of the direct transfers between the FPGA and the GPU and therefore reduce the load on the CPU. The FDMA developed at Institute of Embedded Systems does not need any host interaction after setup and such transfer jitter is extremely low. This makes the FDMA implementation perfect for time critical streaming-applications.

For further information please contact hans.gelke@zhaw.ch or matthias.rosenthal@zhaw.ch

Multi-Channel I2S-Audio to MIPI-Camera Serial Interface (CSI) Converter FPGA-IP

The NVIDA Tegra™ Processors TX1/TX2 with their powerful GPUs are ideal for use in professional audio mixing consoles or audio video equipment. However, if multiple audio channels are required,  the TX1/TX2 is limited to I2S audio inputs. Utilizing the MIPI® Camera Serial Interface (CSI-2) and the InES I2S to CSI-2 converter IP, enables  streaming of up to 256 digital audio channels into the TX1/TX2.

Institute of embeddeded Systems (InES) developed an FPGA-IP which converts the  I2S audio to  up to four CSI-lanes for feeding audio into mobile processors like the NVIDA Tegra™ TX1/TX2.

A Linux driver, which links the received CSI signals to the Tegra™ TX1/TX2 processor buses, is also available. Hence, audio can be processed on the TX1/TX2 GPU or the internal audio blocks.

I2S sources could be audio codecs, SDI or HDMI chips. The CSI-2 protocoll engine can be configured  to generate CSI-2 data packets for one or four CSI lanes, depending on the required bandwidth. The CSI clock and data physical interfaces support differential (high speed) and low power CSI-2 signals.

The IP is written in VHDL and tested with Intel Cyclone-IV FPGAs. It is also possible to be synthesized into Xilinx or Lattice FPGAs.

For more information contact Hans-Joachim Gelke (hans.gelke@zhaw.ch)

Block Diagramm of I2S to CSI IP

Redundant 4k Video Streaming via Several LTE Connections

The InES HPMM research group presents a concept for a mobile and redundant 4K video streaming over LTE networks. It combines powerful 4K video capturing and processing capabilities of dedicated accelerators with the modularity and flexibility of an embedded high performance SoC. The Nvidia TX2 Module is the ideal platform for this purpose.
Since the TX2 supports efficient HEVC encoding, one stream in 4k quality 1), or several streams in HD-quality 2) can be transmitted over one LTE connection 3).  Several LTE channels can be combined together for redundant transmission via different LTE networks.
A video input mixer on the NVIDIA-TX2 GPU allows scaling, overlay and side by side mixing of video sources.
HDMI is fed directly into the TX2 video path via a HDMI to CSI converter.

1) Main profile, up to 1 x 2160p60

2) 4x 1080p60 or 8x 1080P30

3) min. 5 Mbps are required for 2160p30

Low Latency, Highly Reliable Wireless Video Transmission to iPad

Institute of Embedded Systems, a research institute of Zurich University of Applied Sciences generated a reference design for a low latency, highly reliable wireless video transmission from a battery operated camera to an iPad or iPhone. The design is suitable for everything that requires a robust low latency video link such as vehicle remote control, industrial applications, automotive applications and others. Since the transmission is Wi-Fi based, no extra hardware to receive the video stream on an iPad or iPhone is required.
The camera module consists of an Intel SoC-FPGA with integrated single core ARM-A9 with flexible interface to various types of cameras and SDIO interface to the Wi-Fi module. Optional LCD interfaces or an SD-card slot allow monitoring and recording of the video at the camera module.

low_latency_camera_half

The low latency video compression algorithm is nearly lossless and always transmits full frames. While the compression is implemented in the FPGA fabric, control is accomplished by a Linux operating system in the ARM-A9.
Error correction avoids pixel and frame drops even if Wi-Fi transmission is problematic, like in busy areas or in difficult topography. The Wi-Fi standard includes automatic retransmission of lost packets. However, there still remains a chance that packets are lost. To increase reliability even further, we add redundant packets. This slightly increases the bandwidth however does not add significant latency.
To receive the video stream, it is enough to install a viewer app, no extra hardware is required. Video decompression and error correction are solely handled in the GPU and the CPU of the iPad.
The FPGA IP requires only 2.9k logic cells, which is 18% of a 15k logic cell Intel Cyclone-V SoC.
The transmitter IP controls an 802.11n Wi-Fi module like the Texas Instruments WL1835MOD, however other TI modules are supported as well.
The measured glass to glass latency can be as low as 65 ms (2 video frames at 30 fps). However, dependent on the selected compression rate and the Wi-Fi channel quality, the latency might be higher.
For more information, contact Tobias.Welti@zhaw.ch