⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 reconf.notes

📁 create raid tool at linux
💻 NOTES
字号:
From - Mon Dec 18 09:11:17 2000Status: RReturn-Path: <SSmyth@Connex.com>Received: from cairn-gorm.cragtech.com ([216.100.236.3])	by walker.mail.mindspring.net (Mindspring Mail Service) with ESMTP id t3nlq5.803j.37kbi73	Sat, 16 Dec 2000 15:57:41 -0500 (EST)Received: by cairn-gorm.cragtech.com with Internet Mail Service (5.5.2650.21)	id <YNAZ4M3N>; Sat, 16 Dec 2000 12:55:16 -0800Message-ID: <B14BF8753E65D211ABCA00104B301EF501BD7A79@cairn-gorm.cragtech.com>From: Scott Smyth <SSmyth@Connex.com>To: Dale Stephenson <Steph@Connex.com>, Daniel Cox <DCox@Connex.com>Cc: Anne Crosby <ACrosby@Connex.com>, Gus  Andrade <GAndrade@connex.com>, 	Rick Saylor <RSaylor@Connex.com>Subject: FW: raidreconfDate: Sat, 16 Dec 2000 12:55:15 -0800MIME-Version: 1.0X-Mailer: Internet Mail Service (5.5.2650.21)Content-Type: multipart/mixed;	boundary="----_=_NextPart_000_01C067A2.7D729CC0"X-Mozilla-Status: 8001X-Mozilla-Status2: 00000000X-UIDL: t3nlq5.803j.37kbi73This message is in MIME format. Since your mail reader does not understandthis format, some or all of this message may not be legible.------_=_NextPart_000_01C067A2.7D729CC0Content-Type: text/plain;	charset="iso-8859-1"Content-Transfer-Encoding: quoted-printable <<raidtools-0.90-raidreconf.tar.gz>>=20Hi Dale and Danny;As you know the constant explanation of why we cannot doOCE now is a never ending battle.  I would like to putboth of you (SJ/Atlanta team) on the task of taking Jakob'scode and making it work as he (and we) desire.  He hasdone a lot of work, but I am sure there is still more todo here.  I know stuff with LVM will continue to come upand need fixing (along with other stuff of course), butI would like to give this to you and Danny as I believe youtwo are the best suited for this.I would also like to see how a project works out betweenthe two sites.  It also is relatively stand-alone to startwith and means it can be done via laptop.I am attaching what Jakob sent me, but I am going to checkit into the CVS server so please retrieve it from thereand development from there.  Danny you will want to putit into your CVS server there as well.If either of you do not want to work on this, please tellme, and we will see what we can do.Rick, do you see any problem with Danny working with Daleon this?thanks,Scott=20-----Original Message-----From: Jakob =D8stergaardTo: Wisor, Eugene; Scott SmythSent: 12/16/00 5:55 AMSubject: raidreconfHi Eugene and Scott  !You are the two people who inquired about raidreconf, and who expressedinterest in the code that I had.   I suggest the two of you work outhow to best attack this beast.Ok, well here it is   :)The code that I'm sending you has been modified from the last versionthat I tested heavily, so the argument parsing etc. is much nicer.  However,this was done on a Solaris box, so I couldn't test whether the tool stillworked after that :*(   If you really have no luck at all with it, I cansend you the last version that I tested heavily, that has known bugs andugly argument handling, but is at least known to work in some cases  :)There's a file called arytest.c;  this was an idea I had that it wouldbe nice to have a testing facility so we could test the tool without runningas root and depending on the kernel to do all the RAID magic.  However,that means we need to re-implement all the raid-levels - I don't know whatI was thinking when I started that one, but I didn't get any of the levelsdone...   There's some ok code in it, so I didn't have the heart to removethe file - use and abuse as you see fit.  I think we just want to forgetabout it.raidreconf:  The tool has been almost completely rewritten from theversion (0.1) available at my homepage.  Main change:*)  The old version implemented an algorithm to specifically allow growing ofRAID0 arrays.  And another algorithm to allow shrinking of RAID0 arrays.If we want to add RAID5, we need to implement specific algorithms for this aswell.And for conversion, we need to implement a specific algorithm for each level toeach level, meaning 6^2 =3D 36 algorithms in total.  That sucks.  And thepossibility for errors lurking somewhere in a lesser tested algorithmssomewhere is something we don't want to think about.*)  The new version implements a common buffering system, that will requestlevel-drivers to read blocks. The buffering system will then flush blocks tothe disks again, where it sees that the block has been read from the old arrayconfiguration, and just needs to be written for the new configuration.The buffering system is there for performance reasons - it is simple to doconversion correctly, but it is hard to do it correctly *and*efficiently.  The buffer will allow the level specific drivers to read a number of consecutive blocks from the disks, even though (depending on the level) the actual data we're moving may not lie in consecutive blocks on the individual disks.Now all we need to do is to make a driver for each RAID level, then we havearray growth on any level, and any-level <-> any-level conversion.Currently I have implemented RAID0 and single-disk drivers.  RAID0 is by farthe most complicated driver of them all (RAID5 will be simple in comparison),because of the possibility of actually completely utilizing an array of disksof non-equal sizes.   And I suspect that there might be a bug in the RAID0driver still - because some cases of RAID0 conversion fails.  But it's hairy code...The buffer subsystem works with the concept of "wishes" and "fulfilled wishes".A reconfiguration will start with the main loop (see reconfiguration.c) askingthe "source" driver to read a number of blocks.   What will happen is, that thesource driver will place "wishes" in a queue - the driver does not do any I/Oby itself (it's a "logic" driver, so to speak, not a "device" driver).  Theloop will then ask the buffering subsystem to fulfill the wishes - it does nothave to fulfill all wishes, but it has some heuristics as to how many it shouldfulfill, depending on how the wishes are ordered on the physical disks.  Howmany should be fulfilled is not a correctness issue at all, it's a performanceissue only.Once a block has been read from the source configuration (old configuration),we say that the block is "free".  Meaning, that local block on that localdisk can now safely be overwritten, as it has been read to the buffer (or evenwritten to somewhere in the new configuration).  HOWEVER: the ideal way tofree blocks, rarely happens in the same (global block) order as the ideal wayto flush the blocks back to disk (in the new configuration) would be.  Ifwe free one block, then we can often not just flush it to disk, because thatwould mean writing to some local block on some local disk where we have notyet freed that block !   Then we will have to free that block first - but wecan maybe only free it and not flush it, because that again depends on anotherblock that must be freed...  Get the picture ?   That's why we have the buffersubsystem too.   A buffer large enough to hold one block would allow us torecursively follow this free/flush scheme, but it would be horrendouslyinefficient - so we have a large buffer.  Eventually we will be able toflush a number of blocks in an efficient way - however, if we are not, thingswill still work correctly (once the bugs are corrected), it will just beinefficient - but this is unlikely to happen at any large scale I think.See free_block_and_friends() in rrc_common.c for the implementation.Note that the level drivers implement the wishing routine (request_blocks),as well as mapping from disk-local to array-global blocks.  This mappingmust be absolutelly 100% embarassingly correct - and I think that's wherethe problem is with RAID0.  To fix it:  read raid0.c in the kernel, readit again, then again, and then look at the rrc_raid0.c code   :)Actually, looking over the code again:  I do think that I have shrinkingworking to some extent...To do:*) Test with a number of raid0<->raid0 reconfigurations, with disks of varyingsizes...  (tip: use loopback mounted files as disks!)  I'm sure there is a bugthere somewhere.   I wrote some small utilities to fill a disk/array with abit-pattern and reading it again - so that I could identify when blocks weremoved to the wrong places - but I think I lost them. I've actually lost thembefore...  But they're easy to write (I've done that twice too of course)   ;)*) Implement rrc_raid1.c, rrc_raid4.c and rrc_raid5.c.  Don't worry about parity calculation - if the superblock is written as "unclean" (which it is), then the kernel will automagically reconstruct the array - home clean.*) When everything works, then re-implement the request-merging (the wishlisthandling).  The algorithms I wrote are simple and are intended to be correct,but I did not spend time writing *efficient* algorithms.  There are obviousoptimizations that could be applied.    Correctness first, optimization later.*) I think that's pretty much it.  However - it is important that you understand how the buffer stuff and the drivers work together...  Screwing up in where you work with global blocks and where you work with local blocks, or where you are mapping with the source driver and where you're mapping with the sink driver, will give you headaches and "artistically reconfigured" arrays...  90% of the bugs I made in the code was due to this. The remaining 10% of the bugs are still in there   ;)*) oh.  I almost forgot...  Once the tool works, then move it to the kernel toallow on-line transparent reconfiguration.  That will go nicely hand in handwith the on-line transparent ext2 resizing  -  now that's going for the gold !Tips: Use assert().  fprintf() on all error-returns and abnormal-returns. Better make an extra sanity check than counting on something always being the case. if this code is 99% correct, it is still unusable - a core-dump is better than silent corruption.   etc. etc. etc.   (shit I sound like I'm old!)      ;)Feel free to e-mail me with questions - I'd love to help out with that - I'msorry that I haven't had the time to finish this tool myself.  I think that thecode is *fairly* well commented, with explanations to some of the logic(rrc_common.h for example).  However, there may be an out of date commentsomewhere...And again, if the code I'm sending you now is completely unable to perform eventhe simplest reconfigurations, let me know, and I will send you the ugly-args-parsing version that I know is capable of quite some reconfigurations.Gentlemen,  I wish you the best of luck, much joy, and a merry christmas !   :)Cheers,--................................................................:   jakob@unthought.net   : And I see the elder races,         ::.........................: putrid forms of man                ::   Jakob =D8stergaard    : See him rise and claim the earth,  ::        OZ9ABN           : his downfall is at hand.           ::.........................:............{Konkhra}...............:

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -