Intro

With the growth of complexity of the digital circuits, a single SoC (System-On-Chip) can have multiple subsystems and power rails [2]. Quite often is the usage of dynamic voltage and clock scaling, where the SoC will switch between different clock frequencies or clock modes according to the requirements of usage. In the mobile market, if the user wants to play games on the device, some specific subsystems must run at high-performance mode what in general means higher clock speed and no low power techniques active (e.g clock-gating or retention power gating[4]). Although if a low processing task is the main one, lower clock speeds are the right decision to make in different SS.

In all the scenarios mentioned, the SoC needs to handle usually multiple “clock islands” within the same chip. It means that CDC (clock domain crossing) issues must be taken into account when designing the interactions between busses, memories, and digital circuits [3]. Some fundamentals about CDC are quite old and although they look a bit obvious for most engineers, it’s still common in the universities, and sometimes in the industry people trying to solve complex issues with simple synchronizers without carrying about details.

Common SoC nowadays

Considered the mentioned before, I’ve decided recently to revisit some CDC papers and write about some simple gotchas that I didn’t remember anymore. At the end of this post, it’s also present my simple CDC library of common components that can be used to deal with daily issues.

Be careful how you connect the synchronizer circuit when crossing clock boundaries…

Well, as mentioned in [1], if the signal is crossing clock boundaries and it’s connected to a set of combinatorial logic the chances of being unstable are considerable, once it may be still settling due to the datapath. Thus it’s fundamental to have a flip-flop to synchronize every signal that is driven by combo logic before sending it to the new clock domain through the synchronizer. For instance, let’s suppose the following scenario:

wrong_crosssing

In a case like this, we cannot simply connect the output in the clock domain A straight to the synchronizer in the clock domain B once it may capture small bursts of the data A switching through the combinatorial datapath. Due to that, the correct approach to avoid any issues is to register it right before connecting to the synchronizer circuit:

correct_crosssing

Only 2FF might not help you when going from fast to slow domain…

There is an additional consideration that you should take care of when crossing from fast to slow clock domains using the 2FF circuit. First of all, if the pulse is a single clock cycle in the fast clock domain, you might not see it captured by the first flop of the synchronizer. Because of that, the recommended approach to solve that from [1] is to drive (assert for a min. width period) the input signal (the one in the fast clock domain) at least 1.5x the period of the receiving clock frequency (slow domain). Such design is also known as an open-loop solution once we don’t have any req/ack or valid/ready handshake between clock crossing boundaries.

fast_slow_issue_2ff

A quick workaround solution is to drive the signal for a long enough period to ensure that it’ll be sampled by the flops in the slow clock domain.

fast_slow_correct_2ff

Another solution that addresses this issue is the closed-loop 2FF synchronizer. In resume on this approach, we have feedback in the fast clock domain from the synchronizer clocked by the slow domain, making sure that the data has transferred and it’s stable. (see the design cdc_2ff_w_ack.sv).

2ff_closed_loop_solution

When multiple bits are needed to cross clock domain boundaries, don’t simply use 2FF…

One of the most common mistakes when trying to use the 2FF synchronizer is the idea of using it for multi-bit signals. Although all designs that are part of my small library have a DATA_WIDTH parameter, it is not recommended to use this if your idea is to send busses because of the unwanted skew issues. Even the most accurate ASIC processes cannot guarantee that all flops in a die will have the same electrical characteristics. What clearly can lead to different bits being sampled at different clock cycles and messing with the logic in the circuits. Clifford E. details in 3x different categories all the solutions for multi-bit synchronization.

  1. Multi-bit signal consolidation. Where possible, consolidate multiple CDC bits into 1bit CDC signals.
  2. Multi-cycle path formulations. Use a synchronized load signal to safely pass multiple CDC bits.
  3. Pass multiple CDC bits using gray codes.

Shortly, we can resume the first one as the idea of joining multiple-meaning signals into a single one and use the 2FF just for this particular signal. For instance, if your design has a load and an enable both could be fused into a single signal called Ld_En once it’s known that they’ll be asserted together, then your work is to pass this through a 2FF circuit.

Multi-bit signal consolidation

In the second category, we have the concept of sending all the data straight to a flop in the receiving clock domain and only synchronize a single-bit load signal. There are different variations with additional acknowledge signal and some custom circuit for synchronization but the idea surrounds the same as mentioned.

Multi-bit signal single load

The asynchronous FIFO

In my opinion and what I’ll detail a bit more here, the third category is the most used. The other ones are interesting however from my understanding, gray code counters are simple to understand and the asynchronous FIFO that I designed can be very flexible matching different needs. This async FIFO uses gray counters in the read and write pointers and these are the only flops that will be converted from one domain to another. The background reason for using gray counters is that for every increment only a single bit changes, thus eliminate the issue when crossing multi-bit signals through different clock domains.

The design has a single array that stores the data that’ll be transferred from one clock domain to another and two pointers (rd/wr). On one side we have the write clock domain, which is responsible to check if the FIFO is full before writing into it, and on the other side the read clock domain, which ensures that the FIFO is not empty before allowing a read to happen through the interface. In the write clock domain, the wr_data_i/wr_en_i should be connected to the sending circuit and in the read clock domain, the rd_en_i/rd_data_o.

AFIFO example

To compare the pointers for the full flag on the write side, we bring the read pointer to the write clock domain by using a 2FF circuit but first, we encode the read pointer to gray encoding. The same happens when we need to check for the empty flag, where we then encode the write pointer and synchronize in the read clock domain through the 2FF.

Two parameters on top can configure the design for different applications, SLOTS sets the number of depth and the WIDTH of each slot in bits. It important to highlight that SLOTS have to be a power of 2 starting from 2 e.g 2,4,8,16… For this parameter, if the clock frequencies, the burst size of the operations, and the number of idle clock cycles are known, you can simply use the following formula.

$$ FIFO\ depth\ =\ BS\ -BS\ *\lgroup\frac{Read\ Clock\ Freq.}{Write\ Clock\ Freq.*Idle\ cycles}\rgroup $$

For instance, let’s consider the following scenario:

  • Burst size max = 256 (AXI4)
  • Write freq. = 200MHz
  • Read freq. = 100MHz
  • Idle cycles = 1 (no idle cycles)

$$ FIFO\ depth\ =\ 256-256*\lgroup\frac{100}{200*1}\rgroup=\ 128 $$

Use case example for bus conversion

In my NoC (Network-on-chip) project I had to add a synchronizer between each processing element and the network itself. To accomplish this task of working with two different clock frequencies, I encapsulated the flits (smallest NoC data unit) into two asynchronous FIFOs, one in a direction processing element (CPU/Small SoC) to NoC and from NoC to processing element like the diagram below. The code for this design is available here.

CDC pkt on RaveNoC

If the intention is to convert a whole bus like for instance, AXI4 as it was done here by ZipCPU it will be needed at least 5x AFIFOs, one for each channel of the AXI4!

CDC - library of designs

All the components mentioned above are available here, the designs are in System Verilog and there’re some simples tests along to confirm the expected basic behavior.

DesignDescription
cdc_2ff_sync.sv2FF synchronizer circuit
cdc_3ff_sync.sv3FF synchronizer circuit
cdc_2ff_w_ack.sv2FF synchronizer w/ACK feedback
cdc_async_fifo.svAsynchronous FIFO for multi-bit cdc

References

  1. Clifford paper about CDC
  2. Cadence TP about CDC
  3. Practical design for transferring signals between clock domains
  4. Low power design techniques