• performance patterns not comparable across reboots

    From Christian =?ISO-8859-1?Q?D=FCrrhauer?=@cdurrhau@zedat.fu-berlin.de to alt.os.linux.ubuntu on Thu Jul 4 19:05:45 2024
    From Newsgroup: alt.os.linux.ubuntu

    Hi,

    i would like to get a fresh view on my topic.

    The awkward behavior on an Ivy Bridge-based machine (stable operation, no question) is that a few of the hardware components perform differently across reboots. And not reliably or after a pattern, at least not one that i was able to find, but in probably 5 out 15 reboots. That's a digital video recorder and it certainly does not need to be exchanged.

    There is a NVMe 1.3 SSD in a PCIe 4.0 card in a PCIe 3.0 x4 slot (Samsung 990 Pro).
    There is a Realtek 8125B 2,5Gbe network card (PCIe 2.0 x1) in a PCIe 2.0 x1 slot.
    Ubuntu 22.04.4 current (kernel 5.15.101 plus Realtek driver package, r8169 driver blacklisted, booting from SATA drive.

    When the issue occurs, SSD delivers 1.9GB/s. Network card delivers 169MB/s.
    In normal cases, SSD delivers 3,5GB/s, network card delivers 275MB/s (so the difference is significant, but still functionally ok).

    Like i said, i fail to see a pattern. System log files are just too huge, but despite that i tried to compare them and am relatively confident i did not
    find anything striking.

    I have swapped power supply, mainboard, SSD, RAM, CPU, keyboard/mouse. Booting other Ubuntu (clonezilla images) - looks similar.

    Tried googling it but no way finding something, google is too smart and knows what i was looking for (totally polluted with same search terms but totally different context).

    Anyone having an idea what is happening here?
    --
    mit freundlichen Grüßen/with kind regards,

    Christian Dürrhauer
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From azigni@azigni@yahoo.com to alt.os.linux.ubuntu on Thu Jul 4 20:35:32 2024
    From Newsgroup: alt.os.linux.ubuntu

    How old is the hardware that does not get changed out? Electronics like everything else have their issues. It may be you have a failing chip, transistor, cap, or something else.

    Are the connectors and connections solid and clean? Perhaps one of the connections is not seated well, or a pin and plug have worn out and do not connect as well as when they were new.

    If that parts you have changed out do not effect performance glitch, one
    of the parts that have not been changed out do.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From philo@philo@novabbs.com (philo) to alt.os.linux.ubuntu on Fri Jul 5 00:18:43 2024
    From Newsgroup: alt.os.linux.ubuntu

    It looks like you've exchanged everything but the OS itself.

    Try a different distribution such as Fedora maybe.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul@nospam@needed.invalid to alt.os.linux.ubuntu on Fri Jul 5 06:06:32 2024
    From Newsgroup: alt.os.linux.ubuntu

    On 7/4/2024 1:05 PM, Christian Dürrhauer wrote:
    Hi,

    i would like to get a fresh view on my topic.

    The awkward behavior on an Ivy Bridge-based machine (stable operation, no question) is that a few of the hardware components perform differently across reboots. And not reliably or after a pattern, at least not one that i was able
    to find, but in probably 5 out 15 reboots. That's a digital video recorder and
    it certainly does not need to be exchanged.

    There is a NVMe 1.3 SSD in a PCIe 4.0 card in a PCIe 3.0 x4 slot (Samsung 990 Pro).
    There is a Realtek 8125B 2,5Gbe network card (PCIe 2.0 x1) in a PCIe 2.0 x1 slot.
    Ubuntu 22.04.4 current (kernel 5.15.101 plus Realtek driver package, r8169 driver blacklisted, booting from SATA drive.

    When the issue occurs, SSD delivers 1.9GB/s. Network card delivers 169MB/s. In normal cases, SSD delivers 3,5GB/s, network card delivers 275MB/s (so the difference is significant, but still functionally ok).

    Like i said, i fail to see a pattern. System log files are just too huge, but despite that i tried to compare them and am relatively confident i did not find anything striking.

    I have swapped power supply, mainboard, SSD, RAM, CPU, keyboard/mouse. Booting
    other Ubuntu (clonezilla images) - looks similar.

    Tried googling it but no way finding something, google is too smart and knows what i was looking for (totally polluted with same search terms but totally different context).

    Anyone having an idea what is happening here?


    The numbers suggest improper PCI Express negotiation.
    275MB/sec is close to the expected 280MB/sec for a 2.5GbE LAN.
    This means the PCI Express is running at the expected rate,
    the same case where the SSD gets 3.5GB/sec. If the PCIe on a
    Realtek is running at half the rate, then the network output
    will also be "clipped" accordingly. Peripheral cards,
    for the most part, neatly survive starvation and still function.

    An NVMe can be connected to a processor directly, or, it can use
    the PCH (Southbridge) x4 interface, which runs at usually one lower
    standards value than the CPU one. There can be two sled connectors
    on the motherboard. The one nearest the CPU runs at 2x the speed
    of the one nearest the Southbridge heatsink.

    CPU --- PCIe Rev4 --- NVMe connector
    |
    DMI Rev3 ^
    | \
    PCH --- PCIe Rev3 --- NVMe connector <=== PCIe can rate-reduce down to version 1.1 by itself,
    as part of the startup procedure for it. Some modern
    video cards have done this, without telling you.

    The Southbridge (PCH) is usually over-subscribed, which means if all
    the "peripherals" on the Southbridge become busy at the same
    time, the DMI from CPU to PCH does not have the bandwidth for
    that. But I don't think that is happening. And that does not
    affect PHY negotiation in any case. The DMI bus, if it were
    forced to pathological test case, continues to run, and most
    of the time the user might not even be aware there is an issue.

    Like the gears on a vehicle, for some reason a PCIe hub is running
    one standards version too slow. There is probably a way to "jam" this
    in software. For example, a few video cards had a different videoBIOS
    added, to force their bus interface to a PCIe 1.1 revision rate,
    for stability reasons (8800 era). This made the video card, not quite as fast as it could have been, but it also ensured the video card always worked,
    which is pretty important. No more black screens.

    I don't know if "dmesg" has log entries for PCIe or not. The
    hardware itself can negotiate for the highest rate. But there
    should be more than one mechanism for interfering with that.
    I'm a bit worried this is a BIOS code issue (SMI/SMM runs multiple
    times a second).

    One thing I have discovered to my horror, is the BIOS is
    pretty autonomous and not above mischief. My processor was
    crashing, but this was no ordinary crash. This was not an MCE
    (Machine Check Error) like on a legacy CPU. Instead, it would
    appear the BIOS "parked" my processor and turned off both the
    keyboard +5V and the mouse +5V (PS/2 and USB). None of the
    USB ports worked. The mains power (measured by a meter which
    is always present), showed 54W versus idle which is 36W. It's
    my belief the BIOS did this in an SMI service routine. But, I
    cannot find any documentation, nor a means of monitoring the
    BIOS while the OS is running.

    Placing the CPU into another motherboard, the CPU runs normally.
    The BIOS will eventually "tune something" to the point of ruin,
    but it might take a long time before one of these "crashes" comes back.
    And it's not really a crash, it is a kind of Safe Mode for Hardware.
    There is no documentation. Other people have noted something is
    wrong with C state control, and switching off C states (CPU runs warmer),
    also apparently eliminates this BIOS issue.

    When a BIOS ("UEFI") screws around, that destroys the "trust" we had
    in the Legacy BIOS era. UEFI can be programmed from the OS. UEFI
    can even agree to flash itself (automatic updates from motherboard
    maker). There is a huge attack surface for mischief on these
    newer motherboards. Such a bad bad idea. we have learned nothing
    apparently, over the years, about defensive design.

    Paul

    --- Synchronet 3.20a-Linux NewsLink 1.114