1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
|
---
title: Stuff about PCIe
date: 2022-01-03
tags:
- linux
- harwdare
---
## Speed
The most common versions are 3 and 4, while 5 is starting to be
available with newer Intel processors.
| ver | encoding | transfer rate | x1 | x2 | x4 | x8 | x16 |
|-----|-----------|---------------|------------|-------------|------------|------------|-------------|
| 1 | 8b/10b | 2.5GT/s | 250MB/s | 500MB/s | 1GB/s | 2GB/s | 4GB/s |
| 2 | 8b/10b | 5.0GT/s | 500MB/s | 1GB/s | 2GB/s | 4GB/s | 8GB/s |
| 3 | 128b/130b | 8.0GT/s | 984.6 MB/s | 1.969 GB/s | 3.94 GB/s | 7.88 GB/s | 15.75 GB/s |
| 4 | 128b/130b | 16.0GT/s | 1969 MB/s | 3.938 GB/s | 7.88 GB/s | 15.75 GB/s | 31.51 GB/s |
| 5 | 128b/130b | 32.0GT/s | 3938 MB/s | 7.877 GB/s | 15.75 GB/s | 31.51 GB/s | 63.02 GB/s |
| 6 | 128b/130 | 64.0 GT/s | 7877 MB/s | 15.754 GB/s | 31.51 GB/s | 63.02 GB/s | 126.03 GB/s |
This is a
[useful](https://community.mellanox.com/s/article/understanding-pcie-configuration-for-maximum-performance)
link to understand the formula:
Maximum PCIe Bandwidth = SPEED * WIDTH * (1 - ENCODING) - 1Gb/s
We remove 1Gb/s for protocol overhead and error corrections. The main
difference between the generations besides the supported speed is the
encoding overhead of the packet. For generations 1 and 2, each packet
sent on the PCIe has 20% PCIe headers overhead. This was improved in
generation 3, where the overhead was reduced to 1.5% (2/130) - see
[8b/10b encoding](https://en.wikipedia.org/wiki/8b/10b_encoding) and
[128b/130b encoding](https://en.wikipedia.org/wiki/64b/66b_encoding).
If we apply the formula, for a PCIe version 3 device we can expect
3.7GB/s of data transfer rate:
8GT/s * 4 lanes * (1 - 2/130) - 1G = 32G * 0.985 - 1G = ~30Gb/s -> 3750MB/s
## Topology
An easy way to see the PCIe topology is with `lspci`:
$ lspci -tv
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
+-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-01.1-[01]----00.0 OCZ Technology Group, Inc. RD400/400A SSD
+-01.3-[02-03]----00.0-[03]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family
+-01.5-[04]--+-00.0 Intel Corporation I350 Gigabit Network Connection
| +-00.1 Intel Corporation I350 Gigabit Network Connection
| +-00.2 Intel Corporation I350 Gigabit Network Connection
| \-00.3 Intel Corporation I350 Gigabit Network Connection
+-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-07.1-[05]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
| +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
| \-00.3 Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
+-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-08.1-[06]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
| +-00.1 Advanced Micro Devices, Inc. [AMD] Zeppelin Cryptographic Coprocessor NTBCCP
| +-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
| \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
+-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
+-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
+-18.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
+-18.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
+-18.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
+-18.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
+-18.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
+-18.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
+-18.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
\-18.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
Now, how do we read this ?
```
+-[10000:00]-+-02.0-[01]----00.0 Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
| \-03.0-[02]----00.0 Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
```
This is a lot of information, how do we read this ?
- The first part in brackets (`[10000:00]`) is the domain and the bus.
- The second part (`02.0` is still unclear to me)
- The third number (between brackets) is the device on the bus
## View a single device
```sh
lspci -v -s 0000:01:00.0
: 01:00.0 Non-Volatile memory controller: OCZ Technology Group, Inc. RD400/400A SSD (rev 01) (prog-if 02 [NVM Express])
: Subsystem: OCZ Technology Group, Inc. RD400/400A SSD
: Flags: bus master, fast devsel, latency 0, IRQ 41, NUMA node 0
: Memory at ef800000 (64-bit, non-prefetchable) [size=16K]
: Capabilities: <access denied>
: Kernel driver in use: nvme
: Kernel modules: nvme
```
## Reading `lspci` output
$ sudo lspci -vvv -s 0000:01:00.0
01:00.0 Non-Volatile memory controller: OCZ Technology Group, Inc. RD400/400A SSD (rev 01) (prog-if 02 [NVM Express])
Subsystem: OCZ Technology Group, Inc. RD400/400A SSD
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 41
NUMA node: 0
Region 0: Memory at ef800000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <4us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 14, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 05000001 0000010f 02000010 0f86d1a0
Capabilities: [178 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [198 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [1a0 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=400us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Kernel driver in use: nvme
Kernel modules: nvme
A few things to note from this output:
- **GT/s** is the number of transactions supported (here, 8 billion
transactions / second). This is gen3 controller (gen1 is 2.5 and
gen2 is 5)xs
- **LNKCAP** is the capabilities which were communicated, and
**LNKSTAT** is the current status. You want them to report the same
values. If they don't, you are not using the hardware as it is
intended (here I'm assuming the hardware is intended to work as a
gen3 controller). In case the device is downgraded, the output will
be like this: `LnkSta: Speed 2.5GT/s (downgraded), Width x16 (ok)`
- **width** is the number of lanes that can be used by the device
(here, we can use 4 lanes)
- **MaxPayload** is the maximum size of a PCIe packet
## Debugging
PCI configuration registers can be used to debug various PCI bus issues.
The various registers define bits that are either set (indicated with a
'+') or unset (indicated with a '-'). These bits typically have
attributes of 'RW1C' meaning you can read and write them and need to
write a '1' to clear them. Because these are status bits, if you wanted
to 'count' the occurrences of them you would need to write some software
that detected the bits getting set, incremented counters, and cleared
them over time.
The 'Device Status Register' (DevSta) shows at a high level if there
have been correctable errors detected (CorrErr), non-fatal errors
detected (UncorrErr), fata errors detected (FataErr), unsupported
requests detected (UnsuppReq), if the device requires auxillary power
(AuxPwr), and if there are transactions pending (non posted requests
that have not been completed).
10000:01:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
...
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
- The Uncorrectable Error Status (UESta) reports error status of
individual uncorrectable error sources (no bits are set above):
- Data Link Protocol Error (DLP)
- Surprise Down Error (SDES)
- Poisoned TLP (TLP)
- Flow Control Protocol Error (FCP)
- Completion Timeout (CmpltTO)
- Completer Abort (CmpltAbrt)
- Unexpected Completion (UnxCmplt)
- Receiver Overflow (RxOF)
- Malformed TLP (MalfTLP)
- ECRC Error (ECRC)
- Unsupported Request Error (UnsupReq)
- ACS Violation (ACSViol)
- The Uncorrectable Error Mask (UEMsk) controls reporting of
individual errors by the device to the PCIe root complex. A masked
error (bit set) is not recorded or reported. Above shows no errors
are being masked)
- The Uncorrectable Severity controls whether an individual error is
reported as a Non-fatal (clear) or Fatal error (set).
- The Correctable Error Status reports error status of individual
correctable error sources: (no bits are set above)
- Receiver Error (RXErr)
- Bad TLP status (BadTLP)
- Bad DLLP status (BadDLLP)
- Replay Timer Timeout status (Timeout)
- REPLAY NUM Rollover status (Rollover)
- Advisory Non-Fatal Error (NonFatalIErr)
- The Correctable Erro Mask (CEMsk) controls reporting of individual
errors by the device to the PCIe root complex. A masked error (bit
set) is not reported to the RC. Above shows that Advisory Non-Fatal
Errors are being masked - this bit is set by default to enable
compatibility with software that does not comprehend Role-Based
error reporting.
- The Advanced Error Capabilities and Control Register (AERCap)
enables various capabilities (The above indicates the device capable
of generating ECRC errors but they are not enabled):
- First Error Pointer identifies the bit position of the first
error reported in the Uncorrectable Error Status register
- ECRC Generation Capable (GenCap) indicates if set that the
function is capable of generating ECRC
- ECRC Generation Enable (GenEn) indicates if ECRC generation is
enabled (set)
- ECRC Check Capable (ChkCap) indicates if set that the function
is capable of checking ECRC
- ECRC Check Enable (ChkEn) indicates if ECRC checking is enabled
## Compute Express Link (CXL)
[Compute Express Link](https://en.wikipedia.org/wiki/Compute_Express_Link) (CXL) is an open standard for high-speed central processing unit (CPU)-to-device and CPU-to-memory connections, designed for high performance data center computers. The standard is built on top of the PCIe physical interface with protocols for I/O, memory, and cache coherence.
|