Author: rosn (Page 1 of 2)

Train Analog Devices MAX78002 directly from Jupyter Notebook

For the past several months, we have been deep in the trenches with the ai8x-training tool and the training of various Convolutional Neural Network (CNN) architectures tailored explicitly for the MAX78000 and MAX78002 devices. Nevertheless, the ai8x training tool was more of a hindrance than a help.

Among the myriad challenges we encountered, the inability to make real-time adjustments, fine-tune models while freezing or unfreezing specific layers, and transfer custom weights proved to be significant pain points. These restrictions prevented efficient and flexible development.

But, we found a way to train the MAX78000 and MAX78002 devices right from your Jupyter notebook. With this approach, you’ll break free from the shackles of real-time debugging and fine-tuning woes, ensuring seamless interaction with your neural networks. Let’s dive into the nitty-gritty of this process and unlock the full potential of these devices.

This guide will walk you through training MAX7800x directly from a Jupyter notebook. The best part? You won’t need the ai8x-training tool for this process.

The example provided at includes a simple classifier for the Fashion MNIST dataset.

Training for only 7 epochs + 1 QAT epoch will lead to an accuracy of 99% on the test set! And the model is directly ready for deployment on the MAX78000/MAX78002.


Nvidia Jetson Orin NX Modular Vision System

The Institute of Embedded Systems (InES) at ZHAW, with great experience in hardware and software development for NVIDIA Jetson computing modules, is presenting the successor of the NVIDIA Jetson AGX Xavier modular vision system based on the new high-performance NVIDIA Jetson Orin NX.

As the Jetson AGX Xavier system, this new prototyping platform consists of a greatly reduced baseboard which can be extended with different types of M2-footprint modules to add functions like HDMI in, FPD-Link III, USB-C, etc. Due to the modular architecture, a personalized system can be configured. The low complexity and flexibility of the provided interfaces allow the development of more custom-made, application-specific extension boards.

Figure 1: Anyvision Orin NX Carrier Board

The system consists of a minimal motherboard which includes the necessary circuitry to start the Orin NX and program it to an external NVMe SSD.

The baseboard is powered with a common 12V DC power supply. To keep size and costs low, the available peripheral interfaces on the mainboard are limited to 2x USB-A 3.2 Gen 2, 1x Micro-USB for flashing, 1x HDMI out, and 1x Gigabit Ethernet.

Figure 2: Interfaces and Features of Orin NX Motherboard (example configuration with a Dual HDMI module and a CAN module)

All additional interfaces are exposed by M2-footprint sockets. The Networking interface, SSD interface, and PCIe x4 slot meet industry standards. The Video Input and General-Purpose interfaces on the other hand implement a custom pinout defined by ZHAW InES and share dedicated Orin NX interfaces like MIPI CSI-2 (2x 4-lane), USB, I2C, SPI, and UART, as well as GPIO functionalities.

In comparison to the original Jetson AGX Xavier Anyvision baseboard, only one Video Input Interface (instead of two) is available. Furthermore, the Ethernet interface (on the new module, ethernet is accessible directly through the baseboard ethernet port) as well as the Video Output interface were omitted.

An overview of possible configurations and currently available extension modules is given in the table below (for more information about the extension modules see Nvidia Xavier-AGX Modular Vision System).

Interface NameOrin Dedicated LinesAvailable Modules
Video Input Interface (M.2 M)2x 4-lane (or 4x 2-lane) CSI-2 1x I2C 1x I2S 6x GPIODual HDMI (4k30) Module FPD-Link III Module Dual RPI Camera Module
General Purpose Interface (M.2 E)2x USB 3.2 Gen2 1x I2C 1x CAN 1x SPI 1x UART 6x GPIODual USB-C Module CAN Module (3x CAN)
Table 1: Interfaces for custom Anyvision modules
SlotPCIeAvailable Modules
M.2 M (WWAN, SSD)2x PCIeoccupied by NVMe SSD
M.2 E (WiFi, BT, NFC)1x PCIegeneric WiFi or BT modules
PCIe Slot4x PCIegeneric PCIe cards
Table 2: Available industry-compliant PCIe interfaces

In the future, the plan is to expand the selection of available extension modules, for example with an FPD-Link IV or an HDMI 4k60 module.

For more information about the baseboard or the extension modules, feel free to contact us!

Artificial Intelligence on Microcontrollers

Using artificial intelligence algorithms, specifically neural networks on microcontrollers offers several possibilities but reveals challenges: limited memory, low computing power and no operating system. In addition, an efficient workflow to port neural networks algorithms to microcontrollers is required. Currently, several frameworks that can be used to port neural networks to microcontrollers are available. We evaluated and compared four of them:

The frameworks differ considerably in terms of workflow, features and performance. Depending on the application, one has to select the best suited framework. On our github page we offer guides and example applications which can help you to get started with those frameworks!

The neural networks that are generated with all those frameworks are static. This means that once they are integrated into the firmware they cant be changed anymore. However, it would be beneficial if the neural network running on the microcontroller could adapt itself to a changing domain. We developed an algorithm (emb-adta) which could be used for unsupervised domain adaptation on microcontrollers. The prototype python implementation is also available on github!

IntEdgPerf: A new AI benchmark for embedded processors

IntEdgPerf is a new benchmark for running machine learning algorithms on embedded devices. It was developed at the Institute of Embedded Systems (InES) at the Zürich University of Applied Sciences. IntEdgPerf is a framework that allows a fair comparison between different embedded processors that can be used for executing neural networks.

The area of embedded AI is a quickly emerging market where many hardware manufacturers provide accelerators and platforms. So far, metrics and benchmarks provided by the manufacturers are not usable for comparison.

IntEdgPerf incorporates a collection of multiple TensorFlow AI models. It measures the time for the computations of the machine learning algorithm on embedded devices. The benchmark is dynamically extendable by allowing new machine learning models to be integrated. Hardware specific calls can be implemented as modules and integrated in the benchmark.

The benchmark was verified on multiple processors and machine learning accelerators such as Nvidia Quadro K620 GPU, Nvidia Jetson TX1 & TX2 and an Intel Xeon E3-1270V5. Also hardware accelerators, without a direct interface to TensorFlow, such as the Intel Movidius Neural Compute Stick were benchmarked. All tests used unoptimized networks and systems.

The following convolutional models were used in the test:

  • CNN3: This is a fully convolutional neural network, using only convolutional layers. Instead of using maxpool-layers a stride size of two is used to decrease size (the stride size defines the pixel shift of the convolution filter).
  • CNN3Maxpool: The effect of maxpool on the performance is shown by comparing the previously described network to the same version, utilizing maxpool instead of stride sizes higher than one.

  • CNN2FC1 and CNN2MaxpoolFC1: A third and fourth comparison can be made when replacing the last layer of the two previous networks with a fully connected layer. This allows more flexibility for the network for the input size of the image.

For more information, please visit the benchmark’s website at

Machine learning on Cortex-M4 using Keras and ARM-CMSIS-NN

We have developed a simple software  to show how a custom keras model can be automatically translated into c-code. The generated c-code can, in combination with the ARM-CMSIS-NN functions, be used to run neural-net calculations in an efficient way on an embedded micro-controller such as the CORTEX-M4. 

The example software on GitHub has also firmware which runs on the STM32F4-Discorevy BoardPart of the firmware was generated with cubeMX.

The example software has a MNIST classifier which can classify handwritten digits.

See for more Details.

HDMI2CSI now running on 28.2.1

The HDMI2CSI board for capturing 4K HDMI was ported to the latest release of L4T (28.2.1) including the bug fix in the Nvidia VI (see forum: [1], [2]). Major differences where adapted from L4T 28.1 to L4T 28.2.1 and our developed hardware is again able to capture 2160p30. Run the tc358840 outside its specification (>1Gbps) you could also capture 2160p60.

Grab the newest version on Github.

If you are running into problems capturing after a format change restart your application. Currently the Nvidia VI does not recover from an error, namely PXL_SOF syncpt timeout.

Boost your GStreamer pipeline with the GPU plugin

Embedded devices like the Nvidia Tegra X1/2 offer tremendous video processing capabilities. But often there are bottlenecks hindering you from taking advantage of their full potential. One solution to this problem is to employ the general purpose compute capabilities of the GPU (GPGPU). For this purpose, we have developed a GStreamer Plug-In that lets you add a customized video processing functionality to any pipeline with full GPU support.

A possible application is shown in the image below. Two video inputs are combined to a single video output as a picture-in-picture video stream. A 4k image is depicted in the background and on top of it a downscaled FullHD input is streamed.

In order to cope with the huge amount of data, the video processing is outsourced to the GPU. The use of CUDA allows you to create new algorithms from scratch or integrate existing libraries. The plugin enables you to benefit of the unique architecture of the TX1/2, where CPU and GPU share access to the same memory. Therefore, memory access time is reduced  and unnecessary copies are avoided. The next image shows a pipeline of the example mentioned above.

At the beginning of the pipeline, where the data rates are the highest, the GPU and internal Hardware encoders are used. The CPU can then handle the compressed data easily and gives access to the huge number of existing GStreamer Plug-Ins. For example it is capable of preparing a live video stream for clients.

The GStreamer Plug-In can also serve as a basis for other applications like format conversion, debayering or video filters.

Feel free to contact us on this topic.

Open Source drivers for HDMI2CSI module updated to support TX1 and TX2

The HDMI2CSI board for capturing 4K HDMI now supports both TX1 and TX2. Video capturing is fully supported for resolutions up to 2160p30 on Input A and 1080p60 on Input B.

Driver development will continue on L4T 28.1. The previous 24.2.1 branch is considered deprecated.

Get started with the Readme:
and find detailed instructions (for building the Kernel etc.) on the Wiki:

Main changes:

  • Driver for tc358840: Now using the updated version that is already in the 28.1 kernel (with a small modification)
  • Device tree: Adapted to be compatible with 28.1 (if you come from previous L4T, please note the new way of flashing a device tree in U-Boot! Also the structure is different with separate repositories for kernel and device tree)
  • Vi driver: Using the new version from Nvidia instead of our implementation, since it now supports “ganged mode” for combining multiple VI ports
  • Custom resolutions: The EDID can be read and written from the Linux userspace (See [1]) to support different resolutions/timings on the fly

If you want to use Userptr/Dmabuf mode in GStreamer v4l2src, you still need to rebuild GStreamer. The reason is that GStreamer by default uses libv4l for the REQBUF ioctl. The libv4l implementation of this ioctl does NOT support userptr/dmabuf. But you can just build GStreamer without libv4l and it will use correct implementations for the ioctls and work.

Original release:


MIPI CSI/DSI Interface for General Purpose Data Acquisition

Modern SoC devices offer high performance for data analysis and processing. In order to transfer accordingly high data rates, the choices for high speed general purpose interfaces are limited. The first that comes to mind is PCIe, which is available in most high performance SoCs. However, PCIe requires a relatively complex controller on both data source and sink. Additionally the fact that PCIe is such a commonly used interface means that all of the SoCs PCIe controllers may already be occupied by peripherals.

Coming from the mobile market, some SoCs additionally offer MIPI Camera Serial Interface (CSI) / Display Serial Interface (DSI) [1] interfaces, for example the Nvidia Tegra K1 / X1 or Qualcomm Snapdragon 820. These interfaces were designed for high bandwidth video input (CSI) and output (DSI). These state-of-the-art SoCs provide CSI-2 D-PHY interfaces which can have a transmission rate of 1.5 to 2.5 Gbps/lane. One such interface consists of a maximum of 4 data lanes and one clock lane. Typically, one to three interfaces are available, allowing to connect up to six different devices (depending on the SoC model).

Figure 1: MIPI CSI-2 D-PHY interface

Instead of restricting the use of the CSI/DSI interfaces to video only, we propose to use them for transferring general purpose data. The theoretical maximum bandwidth of such an implementation is 30 Gbps (using 3 4-lane MIPI CSI/DSI interfaces).  For a data acquisition application, a sampling rate of 1.875 GSps can be handled. A comparable PCIe x4 v2 interface provides a maximum throughput of 16Gbps, resulting in 1 GSps sampling rate. We successfully implemented and tested digital audio data transmission over CSI/DSI and will continue to explore this interesting interface.

Audio Video Regression Test System

For our test driven way of development we build up a regression test system for our high performance video and audio transmission. The system is used to schedule and run tests and monitor the results in real time. For this, it provides a wide range of interfaces to interact with the system under test. This includes interfaces to monitor and manipulate the network traffic as well as interfaces to generate and analyse video and audio signals.



The system is based on a Linux OS and can therefore be used on many different hardware platforms. The tests to be run are written in Python and can be run automatically or manually. An interface to Jenkins allows to combine the test system with the build flow.

The regression test system provides following advantages:

 – Improved quality due to regression testing

 – Automation of the testing process

 – Simplification of the test implementation

 – Individual adaptions depending on the test dependencies

Improved quality due to regression testing

With regression tests is the system tested with a large number of test cases. Some of the cases are based on the expected behavior of the system. Some cases are based on reports from customers and partners. Before a new software is released it has to pass all this test cases. Like this, each release provides at least as good as the last one and the software will continuously improve with each release.


Automation of the testing process

The InES regression test system provides an interface to Jenkins. This allows to include the tests directly into the build flow. The newest software can be built and downloaded to the target system. Which is then tested with all the regression test cases. The Jenkins web interface allows the user always to see the current progress as well as to change or interrupt some steps if required.

Simplification of the test implementation

The InES regression test systems provides the required interfaces to the device under test as well as the tools to schedule and execute the tests. The user just has to describe the test cases in Python. The test system can be set up on a PC or embedded system. It is also possible to split the test system over multiple platforms.

Individual adaptions depending on the test dependencies

The regression test system is built up modular. It’s possible to deactivate unused interfaces to reduce the requirements for the platform. It is also possible to add new interfaces specifically adapted to the device under test. Like this, it’s possible to adapt the test system perfectly to the device under test as well as to the platform it runs on.

« Older posts