📄 edac.txt

📁 linux 内核源代码
💻 TXT
📖 第 1 页 / 共 2 页
字号:
12 下一页
EDAC - Error Detection And CorrectionWritten by Doug Thompson <dougthompson@xmission.com>7 Dec 200517 Jul 2007	UpdatedEDAC is maintained and written by:	Doug Thompson, Dave Jiang, Dave Peterson et al,	original author: Thayne Harbaugh,Contact:	website:	bluesmoke.sourceforge.net	mailing list:	bluesmoke-devel@lists.sourceforge.net"bluesmoke" was the name for this device driver when it was "out-of-tree"and maintained at sourceforge.net.  When it was pushed into 2.6.16 for thefirst time, it was renamed to 'EDAC'.The bluesmoke project at sourceforge.net is now utilized as a 'staging area'for EDAC development, before it is sent upstream to kernel.orgAt the bluesmoke/EDAC project site, is a series of quilt patches againstrecent kernels, stored in a SVN respository. For easier downloading, thereis also a tarball snapshot available.============================================================================EDAC PURPOSEThe 'edac' kernel module goal is to detect and report errors that occurwithin the computer system running under linux.MEMORYIn the initial release, memory Correctable Errors (CE) and UncorrectableErrors (UE) are the primary errors being harvested. These types of errorsare harvested by the 'edac_mc' class of device.Detecting CE events, then harvesting those events and reporting them,CAN be a predictor of future UE events.  With CE events, the system cancontinue to operate, but with less safety. Preventive maintenance andproactive part replacement of memory DIMMs exhibiting CEs can reducethe likelihood of the dreaded UE events and system 'panics'.NON-MEMORYA new feature for EDAC, the edac_device class of device, was added inthe 2.6.23 version of the kernel.This new device type allows for non-memory type of ECC hardware detectorsto have their states harvested and presented to userspace via the sysfsinterface.Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMAengines, fabric switches, main data path switches, interconnections,and various other hardware data paths. If the hardware reports it, thena edac_device device probably can be constructed to harvest and presentthat to userspace.PCI BUS SCANNINGIn addition, PCI Bus Parity and SERR Errors are scanned for on PCI devicesin order to determine if errors are occurring on data transfers.The presence of PCI Parity errors must be examined with a grain of salt.There are several add-in adapters that do NOT follow the PCI specificationwith regards to Parity generation and reporting. The specification saysthe vendor should tie the parity status bits to 0 if they do not intendto generate parity.  Some vendors do not do this, and thus the parity bitcan "float" giving false positives.In the kernel there is a pci device attribute located in sysfs that ischecked by the EDAC PCI scanning code. If that attribute is set,PCI parity/error scannining is skipped for that device. The attributeis:	broken_parity_statusas is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directorys forPCI devices.FUTURE HARDWARE SCANNINGEDAC will have future error detectors that will be integrated withEDAC or added to it, in the following list:	MCE	Machine Check Exception	MCA	Machine Check Architecture	NMI	NMI notification of ECC errors	MSRs 	Machine Specific Register error cases	and other mechanisms.These errors are usually bus errors, ECC errors, thermal throttlingand the like.============================================================================EDAC VERSIONINGEDAC is composed of a "core" module (edac_core.ko) and several MemoryController (MC) driver modules. On a given system, the COREis loaded and one MC driver will be loaded. Both the CORE andthe MC driver (or edac_device driver) have individual versions that reflectcurrent release level of their respective modules.Thus, to "report" on what version a system is running, one must report boththe CORE's and the MC driver's versions.LOADINGIf 'edac' was statically linked with the kernel then no loading isnecessary.  If 'edac' was built as modules then simply modprobe the'edac' pieces that you need.  You should be able to modprobehardware-specific modules and have the dependencies load the necessary coremodules.Example:$> modprobe amd76x_edacloads both the amd76x_edac.ko memory controller module and the edac_mc.kocore module.============================================================================EDAC sysfs INTERFACEEDAC presents a 'sysfs' interface for control, reporting and attributereporting purposes.EDAC lives in the /sys/devices/system/edac directory.Within this directory there currently reside 2 'edac' components:	mc	memory controller(s) system	pci	PCI control and status system============================================================================Memory Controller (mc) ModelFirst a background on the memory controller's model abstracted in EDAC.Each 'mc' device controls a set of DIMM memory modules. These modules arelaid out in a Chip-Select Row (csrowX) and Channel table (chX). There canbe multiple csrows and multiple channels.Memory controllers allow for several csrows, with 8 csrows being a typical value.Yet, the actual number of csrows depends on the electrical "loading"of a given motherboard, memory controller and DIMM characteristics.Dual channels allows for 128 bit data transfers to the CPU from memory.Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs(FB-DIMMs). The following example will assume 2 channels:		Channel 0	Channel 1	===================================	csrow0	| DIMM_A0	| DIMM_B0 |	csrow1	| DIMM_A0	| DIMM_B0 |	===================================	===================================	csrow2	| DIMM_A1	| DIMM_B1 |	csrow3	| DIMM_A1	| DIMM_B1 |	===================================In the above example table there are 4 physical slots on the motherboardfor memory DIMMs:	DIMM_A0	DIMM_B0	DIMM_A1	DIMM_B1Labels for these slots are usually silk screened on the motherboard. Slotslabeled 'A' are channel 0 in this example. Slots labeled 'B'are channel 1. Notice that there are two csrows possible on aphysical DIMM. These csrows are allocated their csrow assignmentbased on the slot into which the memory DIMM is placed. Thus, when 1 DIMMis placed in each Channel, the csrows cross both DIMMs.Memory DIMMs come single or dual "ranked". A rank is a populated csrow.Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 abovewill have 1 csrow, csrow0. csrow1 will be empty. On the other hand,when 2 dual ranked DIMMs are similarly placed, then both csrow0 andcsrow1 will be populated. The pattern repeats itself for csrow2 andcsrow3.The representation of the above is reflected in the directory treein EDAC's sysfs interface. Starting in directory/sys/devices/system/edac/mc each memory controller will be representedby its own 'mcX' directory, where 'X" is the index of the MC.	..../edac/mc/		   |		   |->mc0		   |->mc1		   |->mc2		   ....Under each 'mcX' directory each 'csrowX' is again represented by a'csrowX', where 'X" is the csrow index:	.../mc/mc0/		|		|->csrow0		|->csrow2		|->csrow3		....Notice that there is no csrow1, which indicates that csrow0 iscomposed of a single ranked DIMMs. This should also apply in bothChannels, in order to have dual-channel mode be operational. Sinceboth csrow2 and csrow3 are populated, this indicates a dual rankedset of DIMMs for channels 0 and 1.Within each of the 'mc','mcX' and 'csrowX' directories are severalEDAC control and attribute files.============================================================================DIRECTORY 'mc'In directory 'mc' are EDAC system overall control and attribute files:Panic on UE control file:	'edac_mc_panic_on_ue'	An uncorrectable error will cause a machine panic.  This is usually	desirable.  It is a bad idea to continue when an uncorrectable error	occurs - it is indeterminate what was uncorrected and the operating	system context might be so mangled that continuing will lead to further	corruption. If the kernel has MCE configured, then EDAC will never	notice the UE.	LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]	RUN TIME:  echo "1" >/sys/devices/system/edac/mc/edac_mc_panic_on_ueLog UE control file:	'edac_mc_log_ue'	Generate kernel messages describing uncorrectable errors.  These errors	are reported through the system message log system.  UE statistics	will be accumulated even when UE logging is disabled.	LOAD TIME: module/kernel parameter: log_ue=[0|1]	RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_log_ueLog CE control file:	'edac_mc_log_ce'	Generate kernel messages describing correctable errors.  These	errors are reported through the system message log system.	CE statistics will be accumulated even when CE logging is disabled.	LOAD TIME: module/kernel parameter: log_ce=[0|1]	RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_log_cePolling period control file:	'edac_mc_poll_msec'	The time period, in milliseconds, for polling for error information.	Too small a value wastes resources.  Too large a value might delay	necessary handling of errors and might loose valuable information for	locating the error.  1000 milliseconds (once each second) is the current	default. Systems which require all the bandwidth they can get, may	increase this.	LOAD TIME: module/kernel parameter: poll_msec=[0|1]	RUN TIME: echo "1000" >/sys/devices/system/edac/mc/edac_mc_poll_msec============================================================================'mcX' DIRECTORIESIn 'mcX' directories are EDAC control and attribute files forthis 'X" instance of the memory controllers:Counter reset control file:	'reset_counters'	This write-only control file will zero all the statistical counters	for UE and CE errors.  Zeroing the counters will also reset the timer	indicating how long since the last counter zero.  This is useful	for computing errors/time.  Since the counters are always reset at	driver initialization time, no module/kernel parameter is available.	RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset		This resets the counters on memory controller 0Seconds since last counter reset control file:	'seconds_since_reset'	This attribute file displays how many seconds have elapsed since the	last counter reset. This can be used with the error counters to	measure error rates.Memory Controller name attribute file:	'mc_name'	This attribute file displays the type of memory controller	that is being utilized.Total memory managed by this memory controller attribute file:	'size_mb'	This attribute file displays, in count of megabytes, of memory	that this instance of memory controller manages.Total Uncorrectable Errors count attribute file:	'ue_count'	This attribute file displays the total count of uncorrectable	errors that have occurred on this memory controller. If panic_on_ue	is set this counter will not have a chance to increment,	since EDAC will panic the system.Total UE count that had no information attribute fileY:	'ue_noinfo_count'	This attribute file displays the number of UEs that	have occurred have occurred with  no informations as to which DIMM	slot is having errors.Total Correctable Errors count attribute file:	'ce_count'
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -