⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 chap3.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 5 页
字号:
  --------------------------------------------------------------------

  A common typo in prose texts is doubled words (hopefully they
  have been edited out of this book except in those few cases
  where they are intended).  The same error occurs to a lesser
  extent in programming language code, configuration files, or
  data feeds.  Regular expressions are well-suited to detecting
  this occurrence, which just amounts to a backreference to a
  word pattern.  It's easy to wrap the regex in a small utility
  with a few extra features:

      #---------- dupwords.py ----------#
      # Detect doubled words and display with context
      # Include words doubled across lines but within paras

      import sys, re, glob
      for pat in sys.argv[1:]:
          for file in glob.glob(pat):
              newfile = 1
              for para in open(file).read().split('\n\n'):
                  dups = re.findall(r'(?m)(^.*(\b\w+\b)\s*\b\2\b.*$)', para)
                  if dups:
                      if newfile:
                          print '%s\n%s\n' % ('-'*70,file)
                          newfile = 0
                      for dup in dups:
                          print '[%s] -->' % dup[1], dup[0]

  This particular version grabs the line or lines on which
  duplicates occur and prints them for context (along with a prompt
  for the duplicate itself). Variations are straightforward. The
  assumption made by 'dupwords.py' is that a doubled word that
  spans a line (from the end of one to the beginning of another,
  ignoring whitespace) is a real doubling; but a duplicate that
  spans paragraphs is not likewise noteworthy.


  PROBLEM: Checking for server errors:
  --------------------------------------------------------------------

  Web servers are a ubiquitous source of information nowadays.
  But finding URLs that lead to real documents is largely
  hit-or-miss.  Every Web maintainer seems to reorganize her site
  every month or two, thereby breaking bookmarks and hyperlinks.
  As bad as the chaos is for plain Web surfers, it is worse for
  robots faced with the difficult task of recognizing the
  difference between content and errors.  By-the-by, it is easy
  to accumulate downloaded Web pages that consist of error
  messages rather than desired content.

  In principle, Web servers can and should return error codes
  indicating server errors.  But in practice, Web servers almost
  always return dynamically generated results pages for erroneous
  requests.  Such pages are basically perfectly normal HTML pages
  that just happen to contain text like "Error 404:  File not
  found!"  Most of the time these pages are a bit fancier than
  this, containing custom graphics and layout, links to site
  homepages, JavaScript code, cookies, meta tags, and all sorts
  of other stuff.  It is actually quite amazing just how much
  many Web servers send in response to requests for nonexistent
  URLs.

  Below is a very simple Python script to examine just what Web
  servers return on valid or invalid requests.  Getting an error
  page is usually as simple as asking for a page called
  'http://somewebsite.com/phony-url' or the like (anything that
  doesn't really exist).  [urllib] is discussed in Chapter 5, but
  its details are not important here.

      #---------- url_examine.py ----------#
      import sys
      from urllib import urlopen

      if len(sys.argv) > 1:
          fpin = urlopen(sys.argv[1])
          print fpin.geturl()
          print fpin.info()
          print fpin.read()
      else:
          print "No specified URL"

  Given the diversity of error pages you might receive, it is
  difficult or impossible to create a regular expression (or any
  program) that determines with certainty whether a given HTML
  document is an error page.  Furthermore, some sites choose to
  generate pages that are not really quite errors, but not
  really quite content either (e.g, generic directories of site
  information with suggestions on how to get to content).  But
  some heuristics come quite close to separating content from
  errors.  One noteworthy heuristic is that the interesting
  errors are almost always 404 or 403 (not a sure thing, but good
  enough to make smart guesses).  Below is a utility to rate the
  "error probability" of HTML documents:

      #---------- error_page.py ----------#
      import re, sys
      page = sys.stdin.read()

      # Mapping from patterns to probability contribution of pattern
      err_pats = {r'(?is)<TITLE>.*?(404|403).*?ERROR.*?</TITLE>': 0.95,
                  r'(?is)<TITLE>.*?ERROR.*?(404|403).*?</TITLE>': 0.95,
                  r'(?is)<TITLE>ERROR</TITLE>': 0.30,
                  r'(?is)<TITLE>.*?ERROR.*?</TITLE>': 0.10,
                  r'(?is)<META .*?(404|403).*?ERROR.*?>': 0.80,
                  r'(?is)<META .*?ERROR.*?(404|403).*?>': 0.80,
                  r'(?is)<TITLE>.*?File Not Found.*?</TITLE>': 0.80,
                  r'(?is)<TITLE>.*?Not Found.*?</TITLE>': 0.40,
                  r'(?is)<BODY.*(404|403).*</BODY>': 0.10,
                  r'(?is)<H1>.*?(404|403).*?</H1>': 0.15,
                  r'(?is)<BODY.*not found.*</BODY>': 0.10,
                  r'(?is)<H1>.*?not found.*?</H1>': 0.15,
                  r'(?is)<BODY.*the requested URL.*</BODY>': 0.10,
                  r'(?is)<BODY.*the page you requested.*</BODY>': 0.10,
                  r'(?is)<BODY.*page.{1,50}unavailable.*</BODY>': 0.10,
                  r'(?is)<BODY.*request.{1,50}unavailable.*</BODY>': 0.10,
                  r'(?i)does not exist': 0.10,
                 }
      err_score = 0
      for pat, prob in err_pats.items():
          if err_score > 0.9: break
          if re.search(pat, page):
              # print pat, prob
              err_score += prob

      if err_score > 0.90:   print 'Page is almost surely an error report'
      elif err_score > 0.75: print 'It is highly likely page is an error report'
      elif err_score > 0.50: print 'Better-than-even odds page is error report'
      elif err_score > 0.25: print 'Fair indication page is an error report'
      else:                 print 'Page is probably real content'

  Tested against a fair number of sites, a collection like this of
  regular expression searches and threshold confidences works
  quite well.  Within the author's own judgment of just what is
  really an error page, 'erro_page.py' has gotten no false
  positives and always arrived at at least the lowest warning
  level for every true error page.

  The patterns chosen are all fairly simple, and both the
  patterns and their weightings were determined entirely
  subjectively by the author.  But something like this weighted
  hit-or-miss technique can be used to solve many "fuzzy logic"
  matching problems (most having nothing to do with Web server
  errors).

  Code like that above can form a general approach to more
  complete applications.  But for what it is worth, the scripts
  'url_examine.py' and 'error_page.py' may be used directly
  together by piping from the first to the second.  For example:

      #*------ Using ex_error_page.py -----#
      % python urlopen.py http://gnosis.cx/nonesuch | python ex_error_page.py
      Page is almost surely an error report


  PROBLEM: Reading lines with continuation characters
  --------------------------------------------------------------------

  Many configuration files and other types of computer code are
  line oriented, but also have a facility to treat multiple lines
  as if they were a single logical line.  In processing such a
  file it is usually desirable as a first step to turn all these
  logical lines into actual newline-delimited lines (or more
  likely, to transform both single and continued lines as
  homogeneous list elements to iterate through later).  A
  continuation character is generally required to be the -last-
  thing on a line before a newline, or possibly the last thing
  other than some whitespace.  A small (and very partial) table
  of continuation characters used by some common and uncommon
  formats is listed below:

      #*----- Common continuation characters -----#
      \  Python, JavaScript, C/C++, Bash, TCL, Unix config
      _  Visual Basic, PAW
      &  Lyris, COBOL, IBIS
      ;  Clipper, TOP
      -  XSPEC, NetREXX
      =  Oracle Express

  Most of the formats listed are programming languages, and
  parsing them takes quite a bit more than just identifying the
  lines.  More often, it is configuration files of various sorts
  that are of interest in simple parsing, and most of the time
  these files use a common Unix-style convention of using
  trailing backslashes for continuation lines.

  One -could- manage to parse logical lines with a [string]
  module approach that looped through lines and performed
  concatenations when needed.  But a greater elegance is served
  by reducing the problem to a single regular expression.  The
  module below provides this:

      #---------- logical_lines.py ----------#
      # Determine the logical lines in a file that might have
      # continuation characters.  'logical_lines()' returns a
      # list.  The self-test prints the logical lines as
      # physical lines (for all specified files and options).

      import re

      def logical_lines(s, continuation='\\', strip_trailing_space=0):
          c = continuation
          if strip_trailing_space:
              s = re.sub(r'(?m)(%s)(\s+)$'%[c], r'\1', s)
          pat_log = r'(?sm)^.*?$(?<!%s)'%[c]  # e.g. (?sm)^.*?$(?<!\\)
          return [t.replace(c+'\n','') for t in re.findall(pat_log, s)]

      if __name__ == '__main__':
          import sys
          files, strip, contin = ([], 0, '\\')
          for arg in sys.argv[1:]:
              if arg[:-1] == '--continue=': contin = arg[-1]
              elif arg[:-1] == '-c': contin = arg[-1]
              elif arg in ('--string','-s'): strip = 1
              else: files.append(arg)
          if not files: files.append(sys.stdin)
          for file in files:
              s = open(sys.argv[1]).read()
              print '\n'.join(logical_lines(s, contin, strip))

  The comment in the 'pat_log' definition shows a bit just how
  cryptic regular expressions can be at times.  The comment is
  the pattern that is used for the default value of
  'continuation'.  But as dense as it is with symbols, you can
  still read it by proceeding slowly, left to right.  Let us try
  a version of the same line with the verbose modifier and
  comments:

      >>> pat = r'''
      ... (?x)    # This is the verbose version
      ... (?s)    # In the pattern, let "." match newlines, if needed
      ... (?m)    # Allow ^ and $ to match every begin- and end-of-line
      ... ^       # Start the match at the beginning of a line
      ... .*?     # Non-greedily grab everything until the first place
      ...         # where the rest of the pattern matches (if possible)
      ... $       # End the match at an end-of-line
      ... (?<!    # Only count as a match if the enclosed pattern was not
      ...         # the immediately last thing seen (negative lookbehind)
      ... \\)     # It wasn't an (escaped) backslash'''


  PROBLEM: Identifying URLs and email addresses in texts
  --------------------------------------------------------------------

  A neat feature of many Internet and news clients is their
  automatic identification of resources that the applications can
  act upon. For URL resources, this usually means making the links
  "clickable"; for an email address it usually means launching a
  new letter to the person at the address. Depending on the nature
  of an application, you could perform other sorts of actions for
  each identified resource. For a text processing application, the
  use of a resource is likely to be something more batch-oriented:
  extraction, transformation, indexing, or the like.

  Fully and precisely implementing RFC1822 (for email addresses)
  or RFC1738 (for URLs) is possible within regular expressions.
  But doing so is probably even more work than is really needed
  to identify 99% of resources.  Moreover, a significant number
  of resources in the "real world" are not strictly compliant
  with the relevant RFCs--most applications give a certain leeway
  to "almost correct" resource identifiers.  The utility below
  tries to strike approximately the same balance of other
  well-implemented and practical applications:  get -almost-
  everything that was intended to look like a resource, and
  -almost- nothing that was intended not to:

      #---------- find_urls.py ----------#
      # Functions to identify and extract URLs and email addresses

      import re, fileinput

      pat_url = re.compile(  r'''
                       (?x)( # verbose identify URLs within text
           (http|ftp|gopher) # make sure we find a resource type
                         :// # ...needs to be followed by colon-slash-slash
              (\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                        (/?| # could be just the domain name (maybe w/ slash)
                  [^ \n\r"]+ # or stuff then space, newline, tab, quote
                      [\w/]) # resource name ends in alphanumeric or slash
           (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                           ) # end of match group
                             ''')
      pat_email = re.compile(r'''
                      (?xm)  # verbose identify URLs in text (and multiline)
                   (?=^.{11} # Mail header matcher
           (?<!Message-ID:|  # rule out Message-ID's as best possible
               In-Reply-To)) # ...and also In-Reply-To
                      (.*?)( # must grab to email to allow prior lookbehind
          ([A-Za-z0-9-]+\.)? # maybe an initial part: DAVID.mertz@gnosis.cx
               [A-Za-z0-9-]+ # definitely some local user: MERTZ@gnosis.cx
                           @ # ...needs an at sign in the middle
                (\w+\.?){2,} # at l

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -