📄 readme.txt

📁 Standard model object recognition matlab code
💻 TXT
字号:
/*    readme.txt    describing the model of object recognition at CBCL   kouh@mit.edu   September 2006.*/1. OVERVIEWThis package of codes is an implementation of a model of rapid object recognition of primate ventral pathway.  The original implementation of the model, formerly known as HMAX, was described in:   Riesenhuber, M. and Poggio, T.    Hierarchical Models of Object Recognition in Cortex.    Nature Neuroscience 2 , 1019-1025 (1999).In terms of the overall architecture, the model is hierarchical and feedforward, reflecting (1) the hierarchical organization of the ventral pathway (eg. van Essen diagram) and (2) the ultra-rapid recognition performance that does not leave much time for feedback or recurrent processes (eg. Thorpe et al).In terms of the computation, the model employs operations for specificity and invariance, two requirements for object recognition, which are performed in an inter-leaved fashion throughout the hierarchy.2. COMPONENTSThere are three main components in the model: (1) neurons in the hierarchy, (2) synaptic weights between neurons, and (3) nonlinear operations performed by local circuitry of these neurons.  Given the hierarchical and feedforward nature of the model, these three components determine the output of the model for any stimuli, as no probabilistic element or noise are in the current model.(1) Neurons in the hierarchy.The responses of each neuron in the hierarchy are stored by a collection (structure) of 3-D matrices.  For example, the structure r has several fields r.r1, r.r2, ..., and r.r5, reflecting five different hierarchical levels in the model (S1, C1, S2, C2b, and VTU).  Each field is a 3-D matrix, where first two dimension correspond to the spatial dimension (x-y), or the location, of the neuron.  The third dimension is for the type of selectivity of the neuron.  For example, the C1 neurons make up 2-D maps with different orientation selectivities.  Considering topological organization of neurons for 2-D images, with different selectivities, there are at least three inherent dimensions to represent collection of neurons.  Note that in the highest level (such as VTU) where neurons cover the entire receptive field and do not have topological organization, we still keep the same 3-d format with 1 by 1 by N matrix.  resp.rn = x by y by number of features (n is the layer).(2) Synaptic weights or filters.The connectivities between the afferent and the efferent neurons are specified in 4-D matrices.  They can be thought of as synaptic weights or filters for performing convolution operations.  The first two dimensions of the filter matrices are for the spatial (x-y) dimension of the afferent neurons, and the third dimension is for different selectivities of the afferent neurons.  Since there can be many different connectivities, they are stacked along the fourth dimension.  In a sense, the responses of the efferent neurons are determined by 3-D convolutions between a filter and the responses of the afferent neurons.  In Matlab notation, r.r(n+1) = convn(r.r(n), filt).  We require that the third dimensions of filter and the previous layer, r.r(n), are of the same size, and as a result of the convolution, the third dimension of the new layer, r.r(n+1), will be of the same size as the fourth dimension of the filter.If we consider the synaptic convergence is how the information/signal is relayed from previous to next layers in the hierarchy of neurons, 4-D representation of these synaptic weights is the natural convention with the 3-D representation of neural responses.For each set of afferents, we consider two different connectivity (f1 and f2), where f1 specifies the direct synaptic weights to the efferent neuron and f2 specifies the indirect synaptic weights (potentially through a shunting-inhibiting pool cell).  Furthermore, the range/extent of afferent connectivity may be different (ie. the first two (x-y) dimension of these 4-D matrices may be different in size, corresponding to the different receptive field sizes of the efferent neurons: some neurons may receive synaptic connections from 3x3 grid of afferents, while others from 5x5, for example).  Hence, such differences in receptive field sizes are organized into different "scales" (from small to large) and denoted with scale tags of the form "_s?" (where ? is an integer denoting the scale index).These 4-D matrices can be quite sparse.  For example, there may be thousand different types of selectivity (third dimension), yet the efferent neuron only makes 10-20 synaptic connections with the afferent neurons.  Hence, in the implementation, we store these 4-D filters in a sparse format, so that the weighted sum operations are performed for the nonzero weights only, saving both processing time and memory.  This packaging of 4-D matrices is done by filt_package.m function (see the documentation), which produces the following fields from two 4-D matrices (f1 and f2):   f.f1: nonzero weights of f1   f.f2: nonzero weights of f2   f.i1: index for the 1st dimension   f.i2: index for the 2nd dimension   f.i3: index for the 3rd dimension   f.size: size of f1 and f2 (4 number for 4 dimensions)   f.shift: sampling of the efferent neurons.(3) OperationsIn order for a robust object recognition, the model/system is required to be specific (for an object) and invariant (of transformations).  Reflecting it, the model has two canonical operations performed throughout the hierarchy.  In the original implementation of 1999, they were Gaussian function for specificity and maximum operation for invariance.  Recently, we proposed that a simple neural circuitry based on weighted sum and divisive normalization (possibly through a shunting inhibition by a pool cell) can accomplish both operations with slight differences in the parameters of the same neural circuits.The operations are of the form   y = sum(f1.*(x.^p)) / (c+sum(f2.*(x.^q)).^r);        weighted sum and divisive normalizationDepending on the values of p, q, and r, y can be a specificity/tuning operation around some particular input x or an invariance operation, such as soft-max.The computation-intensive parts of the operations are the weighted summation in the numerator and the denominator.  Hence, in this implementation, they are implemented in a mex-C file for efficiency/speed.  Two C files (do_sp1.c and do_sp2.c) are provided (depending on how many weighted sums or scalar products need to be computed),   [y1, y2] = do_sp2 (x.^p, x.^q, f1, f2, ...), where        y1 = sum(f1.*(x.^p)) and y2 = sum(f2.*(x.^q)) and      [y1] = do_sp1 (x, f1, ...), where        y1 = sum(f1.*x)Taking advantage of the sparse nature of f1 and f2, these codes perform weighted sums only for the nonzero weights.3. OTHER POINTS(1) Mixing/Combining scalesThe afferents with different receptive field sizes (or scales) may converge onto an efferent neuron.  In particular, going from S2 to C2b layers, different scale bands may be max-pooled into one in order to implement scale invariance.  In this implementation, different scales are combined by enlarging the smaller scale (with nearest neighbor approximation) and applying the user-specified operations (usually soft-max or max).  As for the main computations, the same codes (do_sp1.c or do_sp2.c) are also used for mixing scales.  See comp_mix_scales.m and filt_scalemixing.m.The scales in S1 are combined differently, since similar scales are to be combined.  In this case, the scales are introduced along the third dimension (not as seperate fields in the structure), and the C1 filter are used to combine them.  However, this way of combining scales can be cumbersome when there are many different types of selectivities (ie. third dimension is quite big from the beginning).(2) LearningLearning in the model is based on a very simple procedure of "taking snapshots" of the training images (see Serre et al.).  That is, an activation pattern due to a particular patch of a stimulus image is stored by synaptic weights.  Hence, the neuron will activate most strongly when the same image is encountered again, and will respond less to other images.  Note that it is not the same as fitting a response profile to several different images.  Although Cadieu et al. have successfully tried fitting/reproducing V4 response profiles, such fitting algorithm is not implemented here.4. DEMOStart Matlab, and run "main" from the Matlab prompt.  main.m contains an example scripts.(1) Filesmain_?: are the main functions that initialize parameters and execute other functionsmain_model: is the main, stand-alone function of the modelmain_model_load_save: is used to run on several images, by loading previously computed responses and saving outputs.  The filters are computed once, so all the stimuli should be of the same sizes.init_?: are the functions/scripts for initializing different parameters.init_filter_def: defines most parameters of the filters.init_filt_oper: initializes the filter and operation types.init_learn_param: initializes learn parameters (training images, etc.).init_operation_def: initializes definition of operations.comp_?: are the functions of major computations.comp_crop_or_zeropad: decides whether to crop or zeropad the input.comp_get_next_layer: performs the main computation by calling mex-C files.comp_mix_scales: is used to combine different scales.do_sp?: perform scalar product in mex-C file (most computation).do_sp1.cdo_sp2.cfilt_?: obtain filters/synaptic connectivity.filt_get_S/C_layer: get typical S/C layer filtersfilt_get_S1/C1: get filters for that layer.filt_get_sized: computes filters for that particular size.filt_scalemixing: computes a filter for mixing scales.filt_package: packages 4-D filters into sparse format.learn_?: are the functions for learning in the model.learn_filters_load_save: learns the activation patterns from images.learn_snapshots: is used by learn_filters.m to take snapshots.aux_?: are the auxilary functions used by others.aux_get_filt_info: gets the scale information of a filter.aux_merge_filters: merges different scales of a filter into one.aux_replicate_filters: copies aux_resize_3d: resizes filter during scale-mixing.aux_sub2ind: converts subscripts into indices.aux_quantize_resp: quantize response levels.There are other functions of lesser importance, not listed above.(3) Schematic execution flowSchematically, the model is executed in the following way (see main_model.m):>> init_filt_oper; % Necessary filters and operations are defined,                    % by calling various filt_? functions.)loop of different layers (S1,C1,S2,C2,VTU)>>   comp_crop_zeropad;   % The previous layer is cropped or                           % zeropadded appropriately.)>>   comp_get_next_layer; % Given the previous layer,                           % the next is computed>>      do_sp1.c or do_sp2.c  end-loopIf the learning is involved, the model response upto the to-be-learned level is computed and is saved to be used later.5. HOW TO RUN DIFFERENT EXPERIMENTS (sample list)In most cases, the modification/customization will be done at the init_? or filt_? levels.  For example, adding new operations (within the framework of normalized scalar product) can be added in init_operation_def.  Adding more orientations or scales can be done in filt_get_S1.(1) You may want to try different operations (max, Gauss, different exponents in the normalized scalar product).  Then,   (a) change all_oper variable in init_filt_oper.m.  (b) change/add operation definitions in init_operation_def.m.(2) You may want to train the S2 or VTU layer on different images.  Then,(3) The neural response may be given noises or different quanta/levels of response.  You can put in such properties in comp_get_next_layer.m.6. ERROR CHECKINGSThere are not that many graceful error-checks (ie. typing in wrong operation type will give some error from Matlab, not from me), so expect to do some debugging of your own, if you happened to modify the default version.  Of course, there would be some bugs from my coding (hopefully not too many); sorry about that, and please kindly report those to us.7. ACKNOWLEDGEMENT AND CONTACT INFOThis particular implementation of the model of object recognition (formerly known as HMAX and/or Standard Model) is created thanks to T. Poggio (the architect of the model and its theory), M. Riesenhuber (original implementation), and T. Serre (especially the learning parts).  Most coding and documentations for this package were done by M. Kouh.  Any comments/bug-reports are welcome (kouh@mit.edu).  Thank you.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -