|Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
Seminars in 2018
We present an improved three-step pipeline for the stereo matching problem and introduce multiple novelties at each stage. We propose a new highway network architecture for computing the matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained
with a hybrid loss that supports multilevel comparison of image patches. A novel post-processing step is then introduced, which employs a second deep convolutional neural network for pooling global information from multiple disparities. This network outputs both the image disparity
map, which replaces the conventional ?winner takes all?strategy, and a confidence in the prediction. The confidence score is achieved by training the network with a new technique
that we call the reflective loss. Lastly, the learned confidence is employed in order to better detect outliers in the refinement step. The proposed pipeline achieves state of the art accuracy on the largest and most competitive stereo benchmarks, and the learned confidence is shown to outperform all existing alternatives.
We introduce the dense captioning task, which requires a
computer vision system to both localize and describe salient
regions in images in natural language. The dense captioning
task generalizes object detection when the descriptions
consist of a single word, and Image Captioning when one
predicted region covers the full image. To address the localization
and description task jointly we propose a Fully Convolutional
Localization Network (FCLN) architecture that
processes an image with a single, efficient forward pass, requires
no external regions proposals, and can be trained
end-to-end with a single round of optimization. The architecture
is composed of a Convolutional Network, a novel
dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We
evaluate our network on the Visual Genome dataset, which
comprises 94,000 images and 4,100,000 region-grounded
captions. We observe both speed and accuracy improvements
over baselines based on current state of the art approaches
in both generation and retrieval settings
Attached files: DenseCap-Fully Convolutional Localization Networks for Dense Captioning.pdf
Articulated human pose estimation is a fundamental yet challenging task in computer vision. The difficulty is particularly pronounced in scale variations of human body parts when camera view changes or severe foreshortening happens. Although pyramid methods are widely used to handle scale changes at inference time, learning feature pyramids in deep convolutional neural networks (DCNNs) is still not well explored. In this work, we design a Pyramid Residual Module (PRMs) to enhance the invariance in scales of DCNNs. Given input features, the PRMs learn convolutional filters on various scales of input features, which are obtained with different subsampling ratios in a multibranch network. Moreover, we observe that it is inappropriate to adopt existing methods to initialize the weights of multi-branch networks, which achieve superior performance than plain networks in many tasks recently. Therefore, we provide theoretic derivation to extend the current
weight initialization scheme to multi-branch network structures. We investigate our method on two standard benchmarks for human pose estimation. Our approach obtains state-of-the-art results on both benchmarks. Code is available at https://github.com/bearpaw/PyraNet
Attached files: Learning Feature Pyramids for Human Pose Estimation.pdf
We introduce the notion of semantic background subtraction, a novel framework for motion detection in video sequences. The key innovation consists to leverage object-level semantics to address the variety of challenging scenarios for background subtraction. Our framework combines the information of a semantic segmentation algorithm, expressed by a probability for each pixel, with the output of any background subtraction algorithm to reduce false positive detections produced by illumination changes, dynamic backgrounds, strong shadows, and ghosts. In addition, it maintains a fully semantic background model to improve the detection of camouflaged foreground objects. Experiments led on the CDNet dataset show that we managed to improve, significantly, almost all background subtraction algorithms of the CDNet leaderboard, and reduce the mean overall error rate of all the 34 algorithms (resp. of the best 5 algorithms) by roughly 50% (resp. 20%). Note that a C++ implementation of the framework is available at http://www.telecom.ulg.ac.be/semantic.
Attached files: Braham2017Semantic.pdf
Accurate and fast detection of the moving targets from a moving camera are an important yet challenging problem, especially when the computational resources are limited. In this paper, we propose an effective, efﬁcient, and robust method to accurately detect and segment multiple independently moving foreground targets from a video sequence taken by a monocular moving camera [e.g., onboard an unmannedaerial vehicle(UAV)]. Our proposed method advances the existing methods in a number of ways, where: 1) camera motion is estimated through tracking background keypoints using pyramidal Lucas?CKanade at every detection interval, for efﬁciency; 2) foreground segmentation is applied by integrating a local motion history function with spatio-temporal differencing over a sliding window for detecting multiple moving targets, while the perspective homography is used at image registration for effectiveness; and 3) the detection interval is adjusted dynamically based on a rule-of-thumb technique and considering camera setup parameters for robustness. The proposed method has been tested on a variety of scenarios using a UAV camera, as well as publically available data sets. Based on the reported results and through comparison with the existing methods, the accuracy of the proposed method in detecting multiple moving targets as well as its capability for realtime implementation has been successfully demonstrated. Our method is also robustly applicableto ground-level cameras for the ITS applications, as conﬁrmed by the experimental results. More speciﬁcally, the proposed method shows promising performance compared with the literature in terms of quantitative metrics, while the run-time measures are signiﬁcantly improved for realtime implementation.
Attached files: Effective and Efficient Detection of Moving.pdf 20180310-report-Yang Yu.pptx
We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss. Thanks to parameter sharing between child models, ENAS is fast: it delivers strong empirical performances using much fewer GPU-hours than all existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search. On the Penn Treebank dataset, ENAS discovers a novel architecture that achieves a test perplexity of 55.8, establishing a new state-of-the-art among all methods without post-training processing. On the CIFAR-10 dataset, ENAS designs novel architectures that achieve a test error of 2.89%, which is on par with NASNet (Zoph et al., 2018), whose test error is 2.65%.
Attached files: enas.pdf
Abstract. This paper presents an automatic segmentation system for
characters in text color images cropped from natural images or videos
based on a new neuronal architecture insuring fast processing and robustness
against noise, variations in illumination, complex background
and low resolution. An off-line training phase on a set of synthetic text
color images, where the exact character positions are known, allows adjusting
the neural parameters and thus building an optimal non linear
filter which extracts the best features in order to robustly detect the border
positions between characters. The proposed method is tested on a
set of synthetic text images to precisely evaluate its performance according
to noise, and on a set of complex text images collected from video
frames and web pages to evaluate its performance on real images. The
results are encouraging with a good segmentation rate of 89.12% and a
recognition rate of 81.94% on a set of difficult text images collected from
video frames and from web pages.
Attached files: An_Automatic_Method_for_Video_Character_Segmentati.pdf
The ability to identify and temporally segment fine-
grained human actions throughout a video is crucial for
robotics, surveillance, education, and beyond. Typical ap-
proaches decouple this problem by first extracting local
spatiotemporal features from video frames and then feed-
ing them into a temporal classifier that captures high-
level temporal patterns. We describe a class of temporal
models, which we call Temporal Convolutional Networks
(TCNs), that use a hierarchy of temporal convolutions to
perform fine-grained action segmentation or detection. Our
Encoder-Decoder TCN uses pooling and upsampling to ef-
ficiently capture long-range temporal patterns whereas our
Dilated TCN uses dilated convolutions. We show that TCNs
are capable of capturing action compositions, segment du-
rations, and long-range dependencies, and are over a mag-
nitude faster to train than competing LSTM-based Recur-
rent Neural Networks. We apply these models to three chal-
lenging fine-grained datasets and show large improvements
over the state of the art.
Attached files: Feb saturday seminar final.pptx cand1_Lea_Temporal_Convolutional_Networks_CVPR_2017_paper.pdf
We present an accurate stereo matching method using local expansion moves based on graph cuts. This new move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways: localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local -expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions. With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
Attached files: TPAMI-Contiuous 3D Label Stereo Matching using Local Expansion.pdf
Abstract?We present an accurate stereo matching method using local expansion moves based on graph cuts. This new
move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively
combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as
many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways:
localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local
-expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions.
With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using
randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief
propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity
maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even
using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
Attached files: continuous 3D Label Stereo Matching using Local Expansion moves.pdf
Human actions captured in video sequences are three dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but
invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing
the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark
datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of
similar model complexities.
Attached files: Lattice Long Short-Term Memory for Human Action Recognition.pdf
Stereo matching is a challenging problem with respect to weak texture, discontinuities,illumination difference and occlusions. Therefore, a deep learning framework is presented in this paper, which focuses on the rst and last stage of typical stereo methods: the matching cost computation and the
disparity renement. For matching cost computation, two patch-based network architectures are exploited to allow the trade-off between speed and accuracy, both of which leverage multi-size and multi-layer pooling unit with no strides to learn cross-scale feature representations. For disparity renement, unlike traditional handcrafted renement algorithms, we incorporate the initial optimal and sub-optimal disparity maps before outlier detection. Furthermore, diverse base learners are encouraged to focus on specic replacement tasks, corresponding to the smooth regions and details. Experiments on different datasets demonstrate the effectiveness of our approach, which is able to obtain sub-pixel accuracy and restore occlusions to a great extent. Specically, our accurate framework attains near-peak accuracy both in non-occluded and occluded region and our fast framework achieves competitive performance against the fast algorithms on Middlebury benchmark.
A capsule is a group of neurons whose outputs represent different properties of the same entity. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix which could learn to represent the relationship between that entity and the viewer. A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated using the EM algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The whole system is trained discriminatively by unrolling 3 iterations of EM between each pair of adjacent layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attack than our baseline convolutional neural nettwork.
Attached files: MATRIX CAPSULES WITH EM ROUTING.pdf