⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 edac.txt

📁 linux2.6.16版本
💻 TXT
📖 第 1 页 / 共 2 页
字号:
EDAC - Error Detection And CorrectionWritten by Doug Thompson <norsk5@xmission.com>7 Dec 2005EDAC was written by:	Thayne Harbaugh,	modified by Dave Peterson, Doug Thompson, et al,	from the bluesmoke.sourceforge.net project.============================================================================EDAC PURPOSEThe 'edac' kernel module goal is to detect and report errors that occurwithin the computer system. In the initial release, memory Correctable Errors(CE) and Uncorrectable Errors (UE) are the primary errors being harvested.Detecting CE events, then harvesting those events and reporting them,CAN be a predictor of future UE events.  With CE events, the system cancontinue to operate, but with less safety. Preventive maintainence andproactive part replacement of memory DIMMs exhibiting CEs can reducethe likelihood of the dreaded UE events and system 'panics'.In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devicesin order to determine if errors are occurring on data transfers.The presence of PCI Parity errors must be examined with a grain of salt.There are several addin adapters that do NOT follow the PCI specificationwith regards to Parity generation and reporting. The specification saysthe vendor should tie the parity status bits to 0 if they do not intendto generate parity.  Some vendors do not do this, and thus the parity bitcan "float" giving false positives.The PCI Parity EDAC device has the ability to "skip" known flakeycards during the parity scan. These are set by the parity "blacklist"interface in the sysfs for PCI Parity. (See the PCI section in the sysfssection below.) There is also a parity "whitelist" which is used asan explicit list of devices to scan, while the blacklist is a listof devices to skip.EDAC will have future error detectors that will be added or integratedinto EDAC in the following list:	MCE	Machine Check Exception	MCA	Machine Check Architecture	NMI	NMI notification of ECC errors	MSRs 	Machine Specific Register error cases	and other mechanisms.These errors are usually bus errors, ECC errors, thermal throttlingand the like.============================================================================EDAC VERSIONINGEDAC is composed of a "core" module (edac_mc.ko) and several MemoryController (MC) driver modules. On a given system, the COREis loaded and one MC driver will be loaded. Both the CORE andthe MC driver have individual versions that reflect current releaselevel of their respective modules.  Thus, to "report" on what versiona system is running, one must report both the CORE's and theMC driver's versions.LOADINGIf 'edac' was statically linked with the kernel then no loading isnecessary.  If 'edac' was built as modules then simply modprobe the'edac' pieces that you need.  You should be able to modprobehardware-specific modules and have the dependencies load the necessary coremodules.Example:$> modprobe amd76x_edacloads both the amd76x_edac.ko memory controller module and the edac_mc.kocore module.============================================================================EDAC sysfs INTERFACEEDAC presents a 'sysfs' interface for control, reporting and attributereporting purposes.EDAC lives in the /sys/devices/system/edac directory. Within this directorythere currently reside 2 'edac' components:	mc	memory controller(s) system	pci	PCI status system============================================================================Memory Controller (mc) ModelFirst a background on the memory controller's model abstracted in EDAC.Each mc device controls a set of DIMM memory modules. These modules arelayed out in a Chip-Select Row (csrowX) and Channel table (chX). There canbe multiple csrows and two channels.Memory controllers allow for several csrows, with 8 csrows being a typical value.Yet, the actual number of csrows depends on the electrical "loading"of a given motherboard, memory controller and DIMM characteristics.Dual channels allows for 128 bit data transfers to the CPU from memory.		Channel 0	Channel 1	===================================	csrow0	| DIMM_A0	| DIMM_B0 |	csrow1	| DIMM_A0	| DIMM_B0 |	===================================	===================================	csrow2	| DIMM_A1	| DIMM_B1 |	csrow3	| DIMM_A1	| DIMM_B1 |	===================================In the above example table there are 4 physical slots on the motherboardfor memory DIMMs:	DIMM_A0	DIMM_B0	DIMM_A1	DIMM_B1Labels for these slots are usually silk screened on the motherboard. Slotslabeled 'A' are channel 0 in this example. Slots labled 'B'are channel 1. Notice that there are two csrows possible on aphysical DIMM. These csrows are allocated their csrow assignmentbased on the slot into which the memory DIMM is placed. Thus, when 1 DIMMis placed in each Channel, the csrows cross both DIMMs.Memory DIMMs come single or dual "ranked". A rank is a populated csrow.Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 abovewill have 1 csrow, csrow0. csrow1 will be empty. On the other hand,when 2 dual ranked DIMMs are similiaryly placed, then both csrow0 andcsrow1 will be populated. The pattern repeats itself for csrow2 andcsrow3.The representation of the above is reflected in the directory treein EDAC's sysfs interface. Starting in directory/sys/devices/system/edac/mc each memory controller will be representedby its own 'mcX' directory, where 'X" is the index of the MC.	..../edac/mc/		   |		   |->mc0		   |->mc1		   |->mc2		   ....Under each 'mcX' directory each 'csrowX' is again represented by a'csrowX', where 'X" is the csrow index:	.../mc/mc0/		|		|->csrow0		|->csrow2		|->csrow3		....Notice that there is no csrow1, which indicates that csrow0 iscomposed of a single ranked DIMMs. This should also apply in bothChannels, in order to have dual-channel mode be operational. Sinceboth csrow2 and csrow3 are populated, this indicates a dual rankedset of DIMMs for channels 0 and 1.Within each of the 'mc','mcX' and 'csrowX' directories are severalEDAC control and attribute files.============================================================================DIRECTORY 'mc'In directory 'mc' are EDAC system overall control and attribute files:Panic on UE control file:	'panic_on_ue'	An uncorrectable error will cause a machine panic.  This is usually	desirable.  It is a bad idea to continue when an uncorrectable error	occurs - it is indeterminate what was uncorrected and the operating	system context might be so mangled that continuing will lead to further	corruption. If the kernel has MCE configured, then EDAC will never	notice the UE.	LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]	RUN TIME:  echo "1" >/sys/devices/system/edac/mc/panic_on_ueLog UE control file:	'log_ue'	Generate kernel messages describing uncorrectable errors.  These errors	are reported through the system message log system.  UE statistics	will be accumulated even when UE logging is disabled.	LOAD TIME: module/kernel parameter: log_ue=[0|1]	RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ueLog CE control file:	'log_ce'	Generate kernel messages describing correctable errors.  These	errors are reported through the system message log system.	CE statistics will be accumulated even when CE logging is disabled.	LOAD TIME: module/kernel parameter: log_ce=[0|1]	RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_cePolling period control file:	'poll_msec'	The time period, in milliseconds, for polling for error information.	Too small a value wastes resources.  Too large a value might delay	necessary handling of errors and might loose valuable information for	locating the error.  1000 milliseconds (once each second) is about	right for most uses.	LOAD TIME: module/kernel parameter: poll_msec=[0|1]	RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msecModule Version read-only attribute file:	'mc_version'	The EDAC CORE modules's version and compile date are shown here to	indicate what EDAC is running.============================================================================'mcX' DIRECTORIESIn 'mcX' directories are EDAC control and attribute files forthis 'X" instance of the memory controllers:Counter reset control file:	'reset_counters'	This write-only control file will zero all the statistical counters	for UE and CE errors.  Zeroing the counters will also reset the timer	indicating how long since the last counter zero.  This is useful	for computing errors/time.  Since the counters are always reset at	driver initialization time, no module/kernel parameter is available.	RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset		This resets the counters on memory controller 0Seconds since last counter reset control file:	'seconds_since_reset'	This attribute file displays how many seconds have elapsed since the	last counter reset. This can be used with the error counters to	measure error rates.DIMM capability attribute file:	'edac_capability'	The EDAC (Error Detection and Correction) capabilities/modes of	the memory controller hardware.DIMM Current Capability attribute file:	'edac_current_capability'	The EDAC capabilities available with the hardware	configuration.  This may not be the same as "EDAC capability"	if the correct memory is not used.  If a memory controller is	capable of EDAC, but DIMMs without check bits are in use, then	Parity, SECDED, S4ECD4ED capabilities will not be available	even though the memory controller might be capable of those	modes with the proper memory loaded.Memory Type supported on this controller attribute file:	'supported_mem_type'	This attribute file displays the memory type, usually	buffered and unbuffered DIMMs.Memory Controller name attribute file:	'mc_name'	This attribute file displays the type of memory controller	that is being utilized.Memory Controller Module name attribute file:	'module_name'	This attribute file displays the memory controller module name,	version and date built.  The name of the memory controller	hardware - some drivers work with multiple controllers and	this field shows which hardware is present.Total memory managed by this memory controller attribute file:	'size_mb'

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -