about summary refs log tree commit diff
path: root/content
diff options
context:
space:
mode:
Diffstat (limited to 'content')
-rw-r--r--content/notes/stuff-about-pcie.org196
1 files changed, 196 insertions, 0 deletions
diff --git a/content/notes/stuff-about-pcie.org b/content/notes/stuff-about-pcie.org
new file mode 100644
index 0000000..4d1a825
--- /dev/null
+++ b/content/notes/stuff-about-pcie.org
@@ -0,0 +1,196 @@
+#+TITLE: Stuff about PCIe
+#+DATE: <2022-01-03 Mon>
+#+TAGS[]: linux hardware
+#+toc: t
+
+* Speed
+The most common versions are 3 and 4, while 5 is starting to be available with newer Intel processors.
+
+| ver | encoding  | transfer rate | x1         | x2          | x4         | x8         | x16         |
+|-----+-----------+---------------+------------+-------------+------------+------------+-------------|
+|   1 | 8b/10b    | 2.5GT/s       | 250MB/s    | 500MB/s     | 1GB/s      | 2GB/s      | 4GB/s       |
+|   2 | 8b/10b    | 5.0GT/s       | 500MB/s    | 1GB/s       | 2GB/s      | 4GB/s      | 8GB/s       |
+|   3 | 128b/130b | 8.0GT/s       | 984.6 MB/s | 1.969 GB/s  | 3.94 GB/s  | 7.88 GB/s  | 15.75 GB/s  |
+|   4 | 128b/130b | 16.0GT/s      | 1969 MB/s  | 3.938 GB/s  | 7.88 GB/s  | 15.75 GB/s | 31.51 GB/s  |
+|   5 | 128b/130b | 32.0GT/s      | 3938 MB/s  | 7.877 GB/s  | 15.75 GB/s | 31.51 GB/s | 63.02 GB/s  |
+|   6 | 128b/130  | 64.0 GT/s     | 7877 MB/s  | 15.754 GB/s | 31.51 GB/s | 63.02 GB/s | 126.03 GB/s |
+
+This is a [[https://community.mellanox.com/s/article/understanding-pcie-configuration-for-maximum-performance][useful]] link to understand the formula: Maximum PCIe Bandwidth = *SPEED* * *WIDTH* * (1 - ENCODING) - 1Gb/s.
+
+We remove 1Gb/s for protocol overhead and error corrections. The main difference between the generations besides the supported speed is the encoding overhead of the packet. For generations 1 and 2, each packet sent on the PCIe has 20% PCIe headers overhead. This was improved in generation 3, where the overhead was reduced to 1.5% (2/130) - see [[https://en.wikipedia.org/wiki/8b/10b_encoding][8b/10b encoding]] and [[https://en.wikipedia.org/wiki/64b/66b_encoding][128b/130b encoding]].
+
+If we apply the formula, for a PCIe version 3 device we can expect 3.7GB/s of data transfer rate:
+#+begin_src
+8GT/s * 4 lanes * (1 - 2/130) - 1G = 32G * 0.985 - 1G = ~30Gb/s -> 3750MB/s
+#+end_src
+* Topology
+The easiest way to see the PCIe topology is with =lspci=:
+#+begin_src
+$ lspci -tv
+-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
+           +-01.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+           +-01.1-[01]----00.0  OCZ Technology Group, Inc. RD400/400A SSD
+           +-01.3-[02-03]----00.0-[03]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
+           +-01.5-[04]--+-00.0  Intel Corporation I350 Gigabit Network Connection
+           |            +-00.1  Intel Corporation I350 Gigabit Network Connection
+           |            +-00.2  Intel Corporation I350 Gigabit Network Connection
+           |            \-00.3  Intel Corporation I350 Gigabit Network Connection
+           +-02.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+           +-03.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+           +-04.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+           +-07.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+           +-07.1-[05]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
+           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
+           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
+           +-08.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+           +-08.1-[06]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
+           |            +-00.1  Advanced Micro Devices, Inc. [AMD] Zeppelin Cryptographic Coprocessor NTBCCP
+           |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
+           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
+           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
+           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
+           +-18.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
+           +-18.1  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
+           +-18.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
+           +-18.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
+           +-18.4  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
+           +-18.5  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
+           +-18.6  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
+           \-18.7  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
+#+end_src
+* View a single device
+#+begin_src
+$ lspci -s 0000:01:00.0
+01:00.0 Non-Volatile memory controller: OCZ Technology Group, Inc. RD400/400A SSD (rev 01)
+#+end_src
+* Reading =lspci= output
+#+begin_src
+$ sudo lspci -vvv -s 0000:01:00.0
+01:00.0 Non-Volatile memory controller: OCZ Technology Group, Inc. RD400/400A SSD (rev 01) (prog-if 02 [NVM Express])
+	Subsystem: OCZ Technology Group, Inc. RD400/400A SSD
+	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
+	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
+	Latency: 0, Cache Line Size: 64 bytes
+	Interrupt: pin A routed to IRQ 41
+	NUMA node: 0
+	Region 0: Memory at ef800000 (64-bit, non-prefetchable) [size=16K]
+	Capabilities: [40] Power Management version 3
+		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
+		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
+	Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
+		Address: 0000000000000000  Data: 0000
+	Capabilities: [70] Express (v2) Endpoint, MSI 00
+		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
+			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
+		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
+			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
+			MaxPayload 128 bytes, MaxReadReq 512 bytes
+		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
+		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <4us
+			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
+		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
+			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
+		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
+			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
+		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
+			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
+			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
+			 FRS- TPHComp- ExtTPHComp-
+			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
+		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
+			 AtomicOpsCtl: ReqEn-
+		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
+		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
+			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
+			 Compliance De-emphasis: -6dB
+		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
+			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
+			 Retimer- 2Retimers- CrosslinkRes: unsupported
+	Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
+		Vector table: BAR=0 offset=00002000
+		PBA: BAR=0 offset=00003000
+	Capabilities: [100 v2] Advanced Error Reporting
+		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
+		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
+		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
+		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
+		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
+		AERCap:	First Error Pointer: 14, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
+			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
+		HeaderLog: 05000001 0000010f 02000010 0f86d1a0
+	Capabilities: [178 v1] Secondary PCI Express
+		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
+		LaneErrStat: 0
+	Capabilities: [198 v1] Latency Tolerance Reporting
+		Max snoop latency: 0ns
+		Max no snoop latency: 0ns
+	Capabilities: [1a0 v1] L1 PM Substates
+		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
+			  PortCommonModeRestoreTime=255us PortTPowerOnTime=400us
+		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
+			   T_CommonMode=0us LTR1.2_Threshold=0ns
+		L1SubCtl2: T_PwrOn=10us
+	Kernel driver in use: nvme
+	Kernel modules: nvme
+
+#+end_src
+
+A few things to note from this output:
++ *GT/s* is the number of transactions supported (here, 8 billion transactions / second). This is gen3 controller (gen1 is 2.5 and gen2 is 5)xs
++ *LNKCAP* is the capabilities which were communicated, and *LNKSTAT* is the current status. You want them to report the same values. If they don't, you are not using the hardware as it is intended (here I'm assuming the hardware is intended to work as a gen3 controller). In case the device is downgraded, the output will be like this: =LnkSta: Speed 2.5GT/s (downgraded), Width x16 (ok)=
++ *width* is the number of lanes that can be used by the device (here, we can use 4 lanes)
++ *MaxPayload* is the maximum size of a PCIe packet
+* Debugging
+PCI configuration registers can be used to debug various PCI bus issues.
+
+The various registers define bits that are either set (indicated with a '+') or unset (indicated with a '-'). These bits typically have attributes of 'RW1C' meaning you can read and write them and need to write a '1' to clear them. Because these are status bits, if you wanted to 'count' the occurrences of them you would need to write some software that detected the bits getting set, incremented counters, and cleared them over time.
+
+The 'Device Status Register' (DevSta) shows at a high level if there have been correctable errors detected (CorrErr), non-fatal errors detected (UncorrErr), fata errors detected (FataErr), unsupported requests detected (UnsuppReq), if the device requires auxillary power (AuxPwr), and if there are transactions pending (non posted requests that have not been completed).
+
+#+begin_src
+10000:01:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
+...
+        Capabilities: [100 v1] Advanced Error Reporting
+                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
+                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
+                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
+                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
+                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
+                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
+#+end_src
+
++ The Uncorrectable Error Status (UESta) reports error status of individual uncorrectable error sources (no bits are set
+  above):
+  + Data Link Protocol Error (DLP)
+  + Surprise Down Error (SDES)
+  + Poisoned TLP (TLP)
+  + Flow Control Protocol Error (FCP)
+  + Completion Timeout (CmpltTO)
+  + Completer Abort (CmpltAbrt)
+  + Unexpected Completion (UnxCmplt)
+  + Receiver Overflow (RxOF)
+  + Malformed TLP (MalfTLP)
+  + ECRC Error (ECRC)
+  + Unsupported Request Error (UnsupReq)
+  + ACS Violation (ACSViol)
++ The Uncorrectable Error Mask (UEMsk) controls reporting of individual errors by the device to the PCIe root complex. A
+  masked error (bit set) is not recorded or reported. Above shows no errors are being masked)
++ The Uncorrectable Severity controls whether an individual error is reported as a Non-fatal (clear) or Fatal error
+  (set).
++ The Correctable Error Status reports error status of individual correctable error sources: (no bits are set above)
+  + Receiver Error (RXErr)
+  + Bad TLP status (BadTLP)
+  + Bad DLLP status (BadDLLP)
+  + Replay Timer Timeout status (Timeout)
+  + REPLAY NUM Rollover status (Rollover)
+  + Advisory Non-Fatal Error (NonFatalIErr)
++ The Correctable Erro Mask (CEMsk) controls reporting of individual errors by the device to the PCIe root complex. A
+  masked error (bit set) is not reported to the RC. Above shows that Advisory Non-Fatal Errors are being masked - this
+  bit is set by default to enable compatibility with software that does not comprehend Role-Based error reporting.
++ The Advanced Error Capabilities and Control Register (AERCap) enables various capabilities (The above indicates the
+  device capable of generating ECRC errors but they are not enabled):
+  + First Error Pointer identifies the bit position of the first error reported in the Uncorrectable Error Status register
+  + ECRC Generation Capable (GenCap) indicates if set that the function is capable of generating ECRC
+  + ECRC Generation Enable (GenEn) indicates if ECRC generation is enabled (set)
+  + ECRC Check Capable (ChkCap) indicates if set that the function is capable of checking ECRC
+  + ECRC Check Enable (ChkEn) indicates if ECRC checking is enabled