⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 admin_guide.txt

📁 openPBS的开放源代码
💻 TXT
📖 第 1 页 / 共 2 页
字号:
ENFORCE_PRIME_TIME		FalsePrime-time is defined as a time period each working day (Mon-Fri)from PRIME_TIME_START through PRIME_TIME_END.  Times are in 24hour format (i.e. 9:00AM is 9:00:00, 5:00PM is 17:00:00) with hours, minutes, and seconds.  Sites can use the prime-time scheduling policy for the entire 24 hour period by setting PRIME_TIME_START and PRIME_TIME_END back-to-back.  The portion of a job that fits within primetime must beno longer than PRIME_TIME_WALLT_LIMIT (represented in HH:MM:SS).#PRIME_TIME_START		9:00:00#PRIME_TIME_END			17:00:00#PRIME_TIME_WALLT_LIMIT		1:00:00The next option allows the site to choose an action to take upon schedulerstartup.  The default is to do no special processing (NONE). In someinstances, a job can end up queued in one of the batch queues, since itwas running before but was stopped by PBS. If the argument is RESUBMIT,these jobs will be moved back to the queue the job was originally submittedto, and scheduled as if they had just arrived. If the argument is RERUN,the scheduler will have PBS run any jobs found enqueued on the executionqueues. This may cause the machine to get somewhat confused, as no limitschecking is done (the assumption being that they were checked when theywere enqueued).SCHED_RESTART_ACTION		RESUBMITDefine how long a job should be forced to wait in the queue before beinggiven "extra" priority to run. The priority given is exceeded only by thepriority of the Express queue jobs. Note that this extra priority isignored for jobs from an over-fairshare queue, or if the job owner hasexceeded his/her max running job limit. Value is express in HH:MM:SS(default is 5 days)MAX_WAIT_TIME                   120:00:00Define the threshold at which to choose to Suspend a job rather thanCheckpoint it. Specified value is the percentage of remaining time forthe jobSUSPEND_THRESHOLD               10Define the action to take if both Suspend and Checkpoint fail for agiven job. Given that the primary purpose of checkpoint is to free upresources for a very-high-priority job, setting this to "1" (True)tells the scheduler to force a qhold/qrerun/qrls of the job. This willFORCE the job back into a queued state. If the user has done no job-levelcheckpointing, then the job will be restarted from the beginning.FORCE_REQUEUE                   TrueIf specified, this directive will tell the scheduler to dump an orderedlisting of the jobs to the named file. Useful for users and debugging,but an expensive operation with LOTS of jobs queued, since the file isrewritten for each run of the scheduler.SORTED_JOB_DUMPFILE             /PBS/sched_priv/sorted_jobsName of file into which to save 'fair-share' info. File is rewritten hourlyto ensure current data is saved across scheduler runs. Data in file isrecalculated once per day in order to maintain historical share usage overtime. SHARE_INFO_FILE                 /PBS/sched_priv/share_usageThe Fair Access Directives allow the specification, on a per-queue basis,of a per-user limit on the maximum number of simultaneously running jobsthat a given user can have outstanding at any given time. <access_spec> is:FAIR_SHARE QUEUE:queuename:shares:max_running_jobsFAIR_SHARE QUEUE:firstQ:30:8FAIR_SHARE QUEUE:thirdQ:40:10FAIR_SHARE QUEUE:fifthQ:30:10* Lazy CommentingBecause changing the job comment for each of a large group of jobs can bevery time intensive, there is a notion of lazy comments. The function thatsets the comment on a job takes a flag that indicates whether or not thecomment is optional. Most of the "can't run because ..." comments areconsidered to be optional.When presented with an optional comment, the job will only be altered ifthe job was enqueued after the last run of the scheduler, if it does notalready have a comment, or the job's 'mtime' (modification time) attributeindicates that the job has not been touched in MIN_COMMENT_AGE seconds.This should provide each job with a comment at least once per schedulerlifetime.  It also provides an upper bound (MIN_COMMENT_AGE seconds + thescheduling iteration) on the time between comment updates.This compromise seemed reasonable because the comments themselves are some-what arbitrary, so keeping them up-to-date is not a high priority.This having been said, there may still be time when setting *any* commentmay be too expensive an operation. Say, if you have 10,000 jobs queued,then it may take "too long" to keep the comments updated. If so, you canturn off commenting of jobs completly by setting the UPDATE_COMMENTSoption in the scheduler configuratin file to False.Installing The MSIC-Cluster Scheduler------------------------------------The MSIC-Cluster scheduler is packaged as an alternate scheduler forOpenPBS v.2.3. Basic steps are as follows (note that $PBSSRC is thedirectory into which you extracted the PBS source tree; this is thedirectory that contains the configure and configure.in file, amoungothers); the $PBSOBJ is the top of your object tree:Rebuilding PBS to use MSIC-Cluster scheduler--------------------------------------While it is not necessary to rebuilt all of PBS, it is necessaryto rerun "configure" and then build the scheduler:	cd $PBSOBJ	$PBSSRC/configure [your options] --set-sched-code=msic_cluster	make	make installRequired modifications to existing PBS configuration----------------------------------------------------There are several changes that will need to be made to the PBS configuration.This version of the MSIC-Cluster scheduler can take advantage of the servernodes file, which contains one line per node. (For a detailed explanation ofthe format of the "nodes" file, see the PBS Admin Guide).1. Edit $PBSHOME/server_priv/nodes, and add one line for each execution   host, as the following example shows:   pbsnode1:ts  np=8   pbsnode2:ts  np=8   pbsnode3:ts  np=8   pbsnode4:ts  np=8Where the first column is the hostname of the node, with ":ts" appendedto indicate that the node is a time-share node, and the second column isa number of processor specification. (This NP value is used as the maximumnumber of cpus on a given node. This allows the server to reject a jobimmediately that requests more cpus than can be provided by the currentconfiguration.)2. Create an execution queue that will become a holding queue from which   the scheduler will pull jobs. This queue will need certain minimum   attributes set, as indicated below:   first start the server, if not already running:   #pbs_server   then create and set the queue attributes (example SUBMIT queue   called "pending"):   #qmgr   create queue pending   set queue pending queue_type = Execution   set queue pending resources_default.ncpus = 1   set queue pending resources_default.walltime = 00:05:00   set queue pending enabled = True   set queue pending started = True3. Create an execution queue that corresponds to each execution host.   Set the default and maximum attributes for each execution queue.    It is suggested you dump the qmgr output to a file, and then edit   the file (ie  'qmgr -c "p s" > /tmp/somefile' ; edit the file; and   then load the changed info back into the server: 'qmgr < /tmp/somefile').   The example below shows the recommended attributes (and changes via qmgr):   Note that the "from_route_only" directive prevents users from submitting   directly to the backend execution queues. This is necessary since   priorities are calculated based on the originating queues (i.e. the   route queues defined below).   #qmgr   create queue pbsnode1   set queue pbsnode1 queue_type = Execution   set queue pbsnode1 from_route_only = True   set queue pbsnode1 resources_max.ncpus = 8   set queue pbsnode1 resources_max.walltime = 08:00:00   set queue pbsnode1 resources_default.ncpus = 1   set queue pbsnode1 resources_default.walltime = 00:05:00   set queue pbsnode1 enabled = True   set queue pbsnode1 started = True   ...4. Create as many "originating" Route queues as needed by your local   configuration. These should be route queues with one destination:   the above defined SUBMIT queue. Jobs will carry their originating   queue name with them, and the scheduler will use this to enforce   the queue-specific share limits. For example:   # Create and define queue groupA   #   create queue groupA   set queue groupA queue_type = Route   set queue groupA route_destinations = pending   set queue groupA resources_default.mem = 10mb   set queue groupA enabled = True   set queue groupA started = True   #   # Create and define queue groupB   #   create queue groupB   set queue groupB queue_type = Route   set queue groupB route_destinations = pending   set queue groupB resources_default.mem = 20mb   set queue groupB enabled = True   set queue groupB started = True   #   # Create and define queue groupC   #   create queue groupC   set queue groupC queue_type = Route   set queue groupC route_destinations = pending   set queue groupC resources_default.mem = 30mb   set queue groupC enabled = True   set queue groupC started = True   #   # Create and define queue express   #   create queue express   set queue express queue_type = Route   set queue express route_destinations = pending   set queue groupC resources_default.mem = 40mb   set queue express enabled = True   set queue express started = True      # and then set the default queue on the server. This should generally   # point to the medium priority route queue.   #   set server default_queue = groupB 5. Set the cluster-wide maximum number of cpus (e.g. the count of all the    CPUs on all nodes within the cluster. This info will be used in addition    to the dynamically queried per-node CPU counts. You will need to update    this value if you remove or add nodes to your cluster. Do not worry about    updating it for a temporarily unavailable node.    #qmgr    set server resources_max.ncpus = 48  (or whatever the correct value is)    quitConfiguring the MSIC-Cluster scheduler--------------------------------------The scheduler configuration file (as discussed above) will need to bemodified. Edit $PBSHOME/sched_priv/sched_config changing in particularthe BATCH_QUEUES line. This should contain the list of all the queuesyou have defined, and the associated execution host, eg:Host "hostA.pbs.com" has an associated queue named "hostA"and "pbsnode1.pbs.com" is fed by queue "pbsnode1":BATCH_QUEUES	hostA@hostA.pbs.com,pbsnode1@pbsnode1.pbs.comHowever, the full hostname is not required, so for brevity, one could enter:BATCH_QUEUES	hostA@hostA,pbsnode1@pbsnode1The FAIR_SHARE directive will also need to be updated, as described above.Review the other configuration parameter, and change any as needed.They are currently set to recommended defaults.General Notes-------------This section has some general comments about this scheduler, and thingsto be aware of.Since this scheduler supports the PBS nodes file, you can use the "pbsnodes"commands to view node status, take nodes offline, etc. Here is a shortsummary of handy	pbsnodes -l      # lists all down/offline/unavailable nodes	pbsnodes -a 	 # lists all info for all nodes	pbsnodes -o node # mark the named node OFFLINE, running jobs will			 # will continue to run on that node, but no new			 # jobs will be started on it. Be sure to list all			 # OFFLINE nodes as any not listed will be assumed			 # to be up.	pbsnodes -c node # clear or remove the OFFLINE status on node, 			 # making it available for running jobs again.

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -