Bulletproofing PCIe-based SoCs with Advanced Reliability, Availability, Serviceability (RAS) Mechanisms

1. Introduction

As silicon manufacturing process nodes keep shrinking and transistors get smaller, System-on-Chip (SoC) are increasingly subject to failures due to changing external conditions such as temperature, EMI, power surges, Hot Plug events, etc.

The transition to PCIe 5.0 and 6.0 with increasing PCIe signaling speeds (32GT/s and 64GT/s) also increases the risk of errors due to constricted timing budgets inside the SoC and electrical issues outside the SoC (e.g. crosstalk, line attenuation, jitter, etc.)

In addition, the ever-growing number of PCIe components and systems designed to different revisions of the Specification increases the risk for interoperability issues.

As a result, chip designers who use PCIe as the main communication interface in their SoCs are looking for ways to bulletproof their designs by implementing advanced Reliability, Availability, and Serviceability (RAS) mechanisms that go above and beyond those included in the PCI Express Base Specification.

We start this article by defining “RAS” in the context of PCIe interfacing and looking at the provisions for RAS mechanisms in the PCIe Specification. We then explore some potential PCIe hazards SoC designers can face and the RAS mechanisms that can be implemented to detect, recover, or prevent these hazards. We conclude with recommendations for choosing a PCIe silicon IP solution that helps mitigate these risks.

2. RAS and PCIe

In the context of an SoC’s PCI Express interface, the 3 components of RAS can be defined as:

Reliability: the PCIe interface should never cause the SoC or system to fail. As such, any mechanism that allows the PCIe interface to be more tolerant to changing external or internal conditions is considered a Reliability feature.
Availability: the PCIe interface should remain operational in case of failure of the SoC or system. Any mechanism that allows the PCIe interface to continue to operate in case of component failure is considered an Availability feature.
Serviceability: the PCIe interface should enable quick fixing of PCIe related issues. Any mechanism that allows quick and easy identification of PCIe runtime issues or design bugs is considered a Serviceability feature.

3. RAS features in PCIe Protocol

The PCIe Base Specification defines a set of mechanisms primarily intended for link-level reliability. These include:

LCRC, ACK/NAK, Replay: these features ensure that transmitted packets are received correctly across a link. If not correctly received (LCRC error), a packet is NAKed and Replayed (i.e. resent).
ACK/NAK timeouts: ensure that the link partner is live and can force link retraining if an ACK or NAK is not received within the specified time.
ECRC: end-to-end CRC generation/checking ensures packets are not corrupted in their journey between a requester and a completer (this may occur when crossing other components such as switches).
Timeout counters and LTSSM failsafe mechanisms: ensure that Link Training and Status State Machine always reinitialize if a timeout occurs, with link partners returning to known states.
Advanced Error Reporting (AER): this optional PCIe capability (ECN) provides advanced error signaling and logging.

Although these mechanisms provide basic protection against PCIe link-level hazards, they may prove insufficient for mission-critical deployments in markets such as automotive, HPC/Cloud computing, and enterprise storage/networking.

Potentially hazardous conditions that may require more advanced RAS mechanisms to be detected, reported and/or corrected, include:

The device issues intermittent retries, due to link instability or credit starvation: the system may still be functional; however performance is degraded.
Link partner’s Flow Control updates (ACK, NAK, UpdateFC, etc) are received slightly after timeout due to high system latency. Here again the system may still be functional but with intermittent retries or transitions to the Recovery state.
Link does not reach the L0 state due to non-compliant TSx sequence, Receiver Detect error, etc.: system does not link up.

4. Best Practices for Enhanced Reliability

To improve the reliability of PCIe communications, PCIe designs have integrated a set of best practice mechanisms that are not part of the PCIe Base Specification. These include:

Parity: adding one or more parity bits along the PCIe data path. 1 bit of parity per byte of data is considered good coverage; other options such as 1 bit of parity for 16-bits or 32-bits of data offer lesser coverage but have a lower cost. Parity does not, however, protect against multi-bit errors.
ECC (Error Correcting Code): implementing ECC logic for PCIe data at rest, in particular for receive and transmit buffers. ECC is typically designed for 2-bit error detection and 1-bit correction, which provides a good tradeoff between coverage and cost, with 8 bits of ECC code per 64 bits of data being commonly used.

The combination of Parity and ECC do a good job of protecting PCIe payload in flight and at rest, however, mission-critical applications demand even more advanced RAS mechanisms.

5. Proposed Advanced RAS Mechanisms

In this section we describe PCIe related issues that have been observed in production environments, which can be quickly detected, reported, and/or corrected using advanced RAS mechanisms. We propose a specific solution for each problem and suggest ways in which this solution can be generalized using RAS mechanisms inside and/or around the PCIe interface logic.

5.1. Non-Compliance

Non-compliance of either link partner can lead to the issues discussed in the following examples.

5.1.1. Equalization timeout due to PHY specific FOM timing

In this scenario, the link should initialize at 32GT/s (Gen5) speed as both link partners support this line rate but instead initializes at 2.5GT/s (Gen1) speed, without triggering any errors.

This is due to the time required by the SoC’s PHY Rx circuitry to compute a Figure of Merit (FOM) for a given preset in Equalization Phase 2 that exceeds the Preset timer for EQ Phase 2 of the SoC’s PCIe controller, in turn preventing the next preset to be tested (to potentially obtain a better Bit Error Rate – BER) and forcing the link in some cases to fall back to Gen1 speed.

The problem can be detected by monitoring the appropriate timeout condition and can be resolved by increasing the EQ Phase 2 timeout counter (up to 50% as allowed by the PCIe Specification) to allow for multiple presets to be tested and achieve optimal BER. The EQ timeout counters can be further increased beyond PCIe Specification recommendations for even greater margins.

The solution can be generalized with a Reliability mechanism that includes:

exposing every LTSSM ‘timeout expired’ condition to the SoC’s application logic for detecting the issue (i.e. observability),
extending the range of every LTSSM timeout counter by 50% minimum and allowing these counters to be dynamically programmed by the SoC’s application logic.

5.1.2. Excessive replays due to link partner’s ACK latency

In this scenario, the SoC is able to transmit write packets and read requests, however the throughput observed is lower than what is expected based on the link speed and active lanes.

This is due to the link partner’s ACK latency that exceeds the recommended maximum latency defined in the PCIe Specification, resulting in transmission replays and affecting performance.

The problem can be detected by monitoring the number of Replays initiated and resolved by increasing the ACK timeout counter to accommodate the extra latency.

The solution can be generalized with the Reliability mechanism proposed in the previous section. It should be noted, however, that the size of the Replay Buffer may limit the amount by which the ACK timeout can be increased, in order to avoid buffer overflow.

5.2. Tolerance to Errors and Error Injection

In this scenario, a deadlock is observed after a few days of normal operation with packets no longer being transmitted on the link due to insufficient Tx credit available to the SoC’s application logic.

This is due to malformed TLPs sent by the PCIe controller, possibly as a result of an uncorrectable ECC or parity error at the Transmit Buffer level. The malformed TLP is discarded by the link partner’s receiver and associated credits are lost. Deadlock occurs as a consequence of this credit leakage when Tx credits are no longer available, which can occur over the course of several days depending on the frequency of the errors.

The credit leakage can be identified with an indication from the PCIe controller of the number of malformed TLPs transmitted and can be corrected by ensuring that the PCIe controller does not update the associated credits.

A generalization of the solution involves the implementation of a Reliability mechanism in which the PCIe interface logic is able to transmit errored TLPs on the PCIe link without incurring side effects such as credit leakage. This enables testing of the system hardware and software response to errors. Similarly, for the receive path, the ability of the PCIe interface to generate errored packets on the user interface without side effects enables testing of the application logic and SoC’s response to errors.

The mechanism typically requires a dedicated set of registers and interface to the PCIe interface logic, that enable:

Controlling the number, type, and frequency of errors
Defining hardware triggers for the error injection
Logging and reporting errors
Allowing register access by SoC firmware and/or host software

The number of injectable errors is ultimately a tradeoff between gate count, implementation complexity, and likeliness of occurrence. Common errors that may be supported include LCRC Error, Sequence # Errors, Nullified TLP, Malformed Packet Errors, Block DLLPs (e.g. ACK/NAK), Force DLLPs (e.g. NAKs), Symbol/Framing errors, or Flow control errors (such as nonsensical values or blocked updates).

5.3. Layer-based Monitoring and Troubleshooting

In this example, there are two scenarios:

In one instance, the PCIe link does not link up and LTSSM does not progress past the Detect state possibly due to a problem during the Receiver Detect sequence.
In another instance, the link is unstable with LTSSM frequently transitioning to the Recovery state.

This issue could have multiple causes, originating from either the SoC’s PCIe interface or the link partner’s PCIe interface, including a problem at PHY/MAC level during the training (TSx) sequence, or a problem at Transaction level such as the reception of an Unsupported TLP.

These issues can be detected by probing relevant signals and events at the various layers of the PCIe interface. Bringing the PIPE interface out on the application layer interface can help with PHY/MAC layer issues. It should be noted that data received from PHYPCS (RxData) should go through unscrambling (even for Training Sets at 8GT/s or higher) to be readable.

This mechanism can further be extended to the PCIe interface’s Physical Coding Sublayer (PCS), Data Link Layer (DLL) and Transaction Layer (TL), whereas relevant signals and events are brought out on the application layer’s interface for easy monitoring and troubleshooting. These signals/events may include:

PHY/PCS Layer: Elasticity Buffer SKP Add/Delete, Speed change/Link width change, Entry to Recovery state, Lane state changes
Data Link Layer: Tx/Rx Ack DLLP, Tx/Rx Update FC DLLP, Tx/Rx Nullified TLP, Rx Duplicate TLP
Transaction layer: Tx/Rx packet types, FC credit exhaustion

6. Conclusion

As PCI Express is being further deployed into mission critical applications in the Automotive, AI, and Enterprise markets, the need for higher levels of reliability, availability, and serviceability is increasing.

Whether using a homegrown PCIe interface solution or licensing a PCIe IP solution, SoC designers are looking for mechanisms that provide better visibility of PCIe interface behavior and better control over its operation. Implementing programmable timers and timeouts inside the PCIe interface logic, as well as mechanisms for generating errors without side effects, improve the reliability of the system.

Dedicated monitoring, status, and control interfaces for each of the PCIe interface’s functional layer (PHY, PCS, MAC, DLL, TL) allow SoC designers to flag specific events and errors and improve the overall serviceability and availability of the system.

Rambus is working closely with customers to offer a comprehensive set of RAS mechanisms in its range of PCIe Controller IPs, enabling customers to confidently deploy their SoCs and systems in mission-critical applications in AI, HPC, Automotive, and Enterprise applications.

For more information, please visit us online at www.rambus.com.

Bulletproofing PCIe-based SoCs with Advanced Reliability, Availability, Serviceability (RAS) Mechanisms

1. Introduction

2. RAS and PCIe

3. RAS features in PCIe Protocol

4. Best Practices for Enhanced Reliability

5. Proposed Advanced RAS Mechanisms

5.1. Non-Compliance

5.1.1. Equalization timeout due to PHY specific FOM timing

5.1.2. Excessive replays due to link partner’s ACK latency

5.2. Tolerance to Errors and Error Injection

5.3. Layer-based Monitoring and Troubleshooting

6. Conclusion

Company

Products

Markets

Resources

1. Introduction

2. RAS and PCIe

3. RAS features in PCIe Protocol

4. Best Practices for Enhanced Reliability

5. Proposed Advanced RAS Mechanisms

5.1. Non-Compliance

5.1.1. Equalization timeout due to PHY specific FOM timing

5.1.2. Excessive replays due to link partner’s ACK latency

5.2. Tolerance to Errors and Error Injection

5.3. Layer-based Monitoring and Troubleshooting

6. Conclusion

Reader Interactions

Leave a Reply Cancel reply

Footer

Company

Products

Markets

Resources