📄 admin_guide.txt
字号:
back-to-back. The portion of a job that fits within primetime must beno longer than PRIME_TIME_WALLT_LIMIT (represented in HH:MM:SS).#PRIME_TIME_START 9:00:00#PRIME_TIME_END 17:00:00#PRIME_TIME_WALLT_LIMIT 1:00:00The next option allows the site to choose an action to take upon schedulerstartup. The default is to do no special processing (NONE). In someinstances, a job can end up queued in one of the batch queues, since itwas running before but was stopped by PBS. If the argument is RESUBMIT,these jobs will be moved back to the queue the job was originally submittedto, and scheduled as if they had just arrived. If the argument is RERUN,the scheduler will have PBS run any jobs found enqueued on the executionqueues. This may cause the machine to get somewhat confused, as no limitschecking is done (the assumption being that they were checked when theywere enqueued).SCHED_RESTART_ACTION RESUBMITDefine how long a job should be forced to wait in the queue before beinggiven "extra" priority to run. The priority given is exceeded only by thepriority of the Express queue jobs. Note that this extra priority isignored for jobs from an over-fairshare queue, or if the job owner hasexceeded his/her max running job limit. Value is express in HH:MM:SS(default is 5 days)MAX_WAIT_TIME 120:00:00If specified, this directive will tell the scheduler to dump an orderedlisting of the jobs to the named file. Useful for users and debugging,but an expensive operation with LOTS of jobs queued, since the file isrewritten for each run of the scheduler.SORTED_JOB_DUMPFILE /PBS/sched_priv/sorted_jobsThe Fair Access Directives allow the specification, on a per-architecturebasis, of a per-group limit on the maximum number of CPUs and the maximumamount of memory simultaneously used by running jobs.Format is: FAIR_ACCESS ARCH:arch1:groupA:num_cpus:MB_of_memoryFAIR_ACCESS ARCH:arch1:groupA:30:800FAIR_ACCESS ARCH:arch1:default:40:100FAIR_ACCESS ARCH:arch2:groupB:30:100FAIR_ACCESS ARCH:arch2:default:20:100* Lazy CommentingBecause changing the job comment for each of a large group of jobs can bevery time intensive, there is a notion of lazy comments. The function thatsets the comment on a job takes a flag that indicates whether or not thecomment is optional. Most of the "can't run because ..." comments areconsidered to be optional.When presented with an optional comment, the job will only be altered ifthe job was enqueued after the last run of the scheduler, if it does notalready have a comment, or the job's 'mtime' (modification time) attributeindicates that the job has not been touched in MIN_COMMENT_AGE seconds.This should provide each job with a comment at least once per schedulerlifetime. It also provides an upper bound (MIN_COMMENT_AGE seconds + thescheduling iteration) on the time between comment updates.This compromise seemed reasonable because the comments themselves are some-what arbitrary, so keeping them up-to-date is not a high priority.Installing The UMN-Cluster Scheduler------------------------------------The UMN-Cluster scheduler is packaged as an optional scheduler for OpenPBSv.2.3. Basic steps are as follows (note that $PBSSRC is the directory intowhich you extracted the PBS source tree; this is the directory that containsthe configure and configure.in file, amoung others); $PBSOBJ is the top ofyour object tree.Rebuilding PBS to use UMN-Cluster scheduler--------------------------------------While it is not necessary to rebuilt all of PBS, it is necessaryto rerun "configure" and then build the scheduler: cd $PBSOBJ $PBSSRC/configure [your options] --set-sched-code=umn_cluster make make installNote: To run this scheduler on system that have more than 2gb of memory, you should build the scheduler in 64-bit mode, using theconfigure options specific for your compiler.Required modifications to existing PBS configuration----------------------------------------------------There are several changes that will need to be made to the PBS configuration.The UMN-Cluster scheduler takes advantage of the server nodes file, whichcontains one line per node. (For a detailed explaination of the format ofthe "nodes" file, see the PBS Administrator Guide).1. Edit $PBSHOME/server_priv/nodes, and add one line for each execution host, as the following example shows: pbsnode1:ts np=8 pbsnode2:ts np=8 pbsnode3:ts np=8 pbsnode4:ts np=8Where the first column is the hostname of the node, appended with a":ts" denoting that it is a timeshared node, and the second columnis a number of processor specification. (This NP value is used as the maximum number of cpus on a given node. This allowsthe server to reject a job immediately that requests more cpusthan can be provided by the current configuration.)2. Create an execution queue that will become a holding queue from which the scheduler will pull jobs. This queue will need certain minimum attributes set, as indicated below: first start the server, if not already running: #pbs_server then create and set the queue attributes (example SUBMIT queue called "pending"): #qmgr create queue pending set queue pending queue_type = Execution set queue pending resources_default.ncpus = 1 set queue pending resources_default.walltime = 00:05:00 set queue pending enabled = True set queue pending started = True3. Create an execution queue that corresponds to each execution host. Set the default and maximum attributes for each execution queue. It is suggested you dump the qmgr output to a file, and then edit the file (ie 'qmgr -c "p s" > /tmp/somefile' ; edit the file; and then load the changed info back into the server: 'qmgr < /tmp/somefile'). The example below shows the recommended attributes (and changes via qmgr): Note that the "from_route_only" directive prevents users from submitting directly to the backend execution queues. This is necessary since priorities are calculated based on the originating queues (i.e. the route queues defined below). #qmgr create queue pbsnode1 set queue pbsnode1 queue_type = Execution set queue pbsnode1 from_route_only = True set queue pbsnode1 resources_max.ncpus = 8 set queue pbsnode1 resources_max.walltime = 08:00:00 set queue pbsnode1 resources_max.memory = 240mb set queue pbsnode1 resources_default.ncpus = 1 set queue pbsnode1 resources_default.walltime = 00:05:00 set queue pbsnode1 enabled = True set queue pbsnode1 started = True ...4. Create as many "originating" Route queues as needed by your local configuration. These should be route queues with one destination: the above defined SUBMIT queue. Jobs will carry their originating queue name with them. # Create and define queue groupA # create queue groupA set queue groupA queue_type = Route set queue groupA route_destinations = pending set queue groupA resources_default.mem = 10mb set queue groupA enabled = True set queue groupA started = True # # Create and define queue groupB # create queue groupB set queue groupB queue_type = Route set queue groupB route_destinations = pending set queue groupB resources_default.mem = 20mb set queue groupB enabled = True set queue groupB started = True # and then set the default queue on the server. # set server default_queue = pending 5. Set the cluster-wide maximum number of cpus (e.g. the count of all the CPUs on all nodes within the cluster. This info will be used in addition to the dynamically queried per-node CPU counts. You will need to update this value if you remove or add nodes to your cluster. Do not worry about updating it for a temporarily unavailable node. #qmgr set server resources_max.ncpus = 48 (or whatever the correct value is) quitConfiguring the UMN-Cluster scheduler--------------------------------------The scheduler configuration file (as discussed above) will need to bemodified. Edit $PBSHOME/sched_priv/sched_config changing in particularthe BATCH_QUEUES line. This should contain the list of all the queuesyou have defined, and the associated execution host, eg:Host "hostA.pbs.com" has an associated queue named "hostA" and"pbsnode1.pbs.com" is fed by queue "pbsnode1":BATCH_QUEUES hostA@hostA.pbs.com,pbsnode1@pbsnode1.pbs.comHowever, the full hostname is not required, so for brevity, one could enter:BATCH_QUEUES hostA@hostA,pbsnode1@pbsnode1The FAIR_ACCESS directive will also need to be updated, as described above.Review the other configuration parameters, and change any as needed.They are currently set to recommended defaults.General Notes-------------This section has some general comments about this scheduler, and thingsto be aware of.Since this scheduler supports the PBS nodes file, you can use the "pbsnodes"commands to view node status, take nodes offline, etc. Here is a shortsummary of handy pbsnodes -l # lists all down/offline/unavailable nodes pbsnodes -a # lists all info for all nodes pbsnodes -o node # mark the named node OFFLINE, running jobs will # will continue to run on that node, but no new # jobs will be started on it. Be sure to list all # OFFLINE nodes as any not listed will be assumed # to be up. pbsnodes -c node # clear or remove the OFFLINE status on node, # making it available for running jobs again.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -