|Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
Seminars in 2018
A common approach for moving objects segmentation in a scene is to perform a background subtraction. Several methods have been proposed in this domain. However, they lack the ability of handling various difficult scenarios such as illumination changes, background or camera motion, camouflage effect, shadow etc. To address these issues, we propose a robust and flexible encoder-decoder type neural network based approach. We adapt a pretrained convolutional network, i.e. VGG-16 Net, under a triplet framework in the encoder part to embed an image in multiple scales into the feature space and use a transposed convolutional network in the decoder part to learn a mapping from feature space to image space. We train this network end-to-end by using only a few training samples. Our network takes an RGB image in three different scales and produces a foreground segmentation probability mask for the corresponding image. In order to evaluate our model, we entered the Change Detection 2014 Challenge (changedetection.net) and our method outperformed all the existing state-of-the-art methods by an average F-Measure of 0.9770. Our source code will be made publicly available at https://github.com/lim-anggun/FgSegNet.
Attached files: fgSegNet_triplet.pdf
Abstract?The text data present in overlaid bands convey brief
descriptions of news events in broadcast videos. The process of
text extraction becomes challenging as overlay text is presented
in widely varying formats and often with animation effects. We
note that existing edge density based methods are well suited
for our application on account of their simplicity and speed of
operation. However, these methods are sensitive to thresholds
and have high false positive rates. In this paper, we present
a contrast enhancement based preprocessing stage for overlay
text detection and a parameter free edge density based scheme
for efficient text band detection. The second contribution of this
paper is a novel approach for multiple text region tracking with
a formal identification of all possible detection failure cases. The
tracking stage enables us to establish the temporal presence of
text bands and their linking over time. The third contribution
is the adoption of Tesseract OCR for the specific task of overlay
text recognition using web news articles. The proposed approach
is tested and found superior on news videos acquired from three
Indian English television news channels along with benchmark
Attached files: Overlay text ectraction.pdf
The topic of multi-person pose estimation has been largely improved recently, especially with the development of convolutional neural network. However, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints and complex background, which cannot be well addressed. In this paper, we present a novel network structure called Cascaded Pyramid Network (CPN) which
targets to relieve the problem from these ?hard? keypoints. More specifically, our algorithm includes two stages: GlobalNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the ?simple? keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. Our RefineNet tries explicitly handling the ?hard? keypoints by integrating all levels of feature representations from the GlobalNet together with an online hard keypoint mining loss. In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by our CPN for keypoint localization in each human bounding box. Based on the proposed algorithm, we achieve state-of-art results on the COCO keypoint benchmark, with average precision at 73.0 on the COCO test-dev dataset and 72.1 on the COCO test-challenge dataset, which is a 19%
relative improvement compared with 60.5 from the COCO 2016 keypoint challenge. Code and the detection results are publicly available for further research.
Attached files: Cascaded Pyramid Network for Multi-Person Pose Estimation.pdf
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale,
pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN),
shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
Attached files: Lin_Feature_Pyramid_Networks_CVPR_2017_paper.pdf
This study proposes an automatic reading approach for a pointer gauge based on computer vision. Moreover, the study aims to highlight the defects of the current automatic-recognitionmethod of the pointer gauge and introduces amethod that uses a coarseto- fine scheme and has superior performance in the accuracy and stability of its reading identification. First, it uses the region growing method to locate the dial region and its center. Second, it uses an improved central projection method to determine the circular scale region under the polar coordinate system and detect the scale marks. Then, the border detection is implemented in the dial image, and the Hough transform method is used to obtain the pointer direction by means of pointer contour fitting. Finally, the reading of the gauge is obtained by comparing the location of the pointer with the scalemarks. The experimental results demonstrate the effectiveness of the proposed approach. This approach is applicable for reading gauges whose scalemarks are either evenly or unevenly distributed.
Attached files: Machine Vision Based Automatic Detection Method of Indicating Values of a Pointer Gauge.pdf
We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding.
The proposed method addresses an important problem of video understanding: how to build a video representation that incorporates the CNN features over the entire video. Feature assignment is carried out at two levels, by using the similarity and spatio-temporal information. For each assignment we build a specific encoding, focused on the nature of
deep features, with the goal to capture the highest feature responses from the highest neuron activation of the network. Our ST-VLMPF clearly provides a more reliable video representation than some of the most widely used and powerful encoding approaches (Improved Fisher Vectors and Vector of Locally Aggregated Descriptors), while maintaining a low computational complexity. We conduct experiments on three action recognition datasets: HMDB51, UCF50 and UCF101. Our pipeline obtains state-of-the-art results.
It plays an important role to accurately track multiple vehicles in intelligent transportation, especially in intelligent vehicles. Due to complicated trafﬁc environments it is difﬁcult to track multiple vehicles accurately and robustly, especially when there are occlusions among vehicles. To alleviate these problems, a new approach is proposed to track multiple vehicles with the combination of robust detection and two classiﬁers. An improved ViBe algorithm is proposed for robust and accurate detection of multiple vehicles. It uses the gray-scale spatial information to build dictionary of pixel life length to make ghost shadows and object??s residual shadows quickly blended into the samples of the background. The improved algorithm takes good post-processing method to restrain dynamic noise. In this paper, we also design a method using two classiﬁers to further attack the problem of failure to track vehicles with occlusions and interference. It classiﬁes tracking rectangles with conﬁdence values between two thresholds through combining local binary pattern with support vector machine (SVM) classiﬁer and then using a convolutional neural network (CNN) classiﬁer for the second time to remove the interference areas between vehicles and other moving objects. The two classiﬁers method has both time efﬁciency advantage of SVM and high accuracy advantage of CNN. Comparing with several existing methods, the qualitative and quantitative analysis of our experiment results showed that the proposed method not only effectively removed the ghost shadows, and improved the detection accuracy and real-time performance, but also was robust to deal with the occlusion of multiple vehicles in various trafﬁc scenes.
Attached files: A New Approach to Track Multiple Vehicles With.pdf 20180526-report-Yang Yu.pptx
This study proposes an automatic reading approach for a pointer gauge based on computer vision. Moreover, the study aims to
highlight the defects of the current automatic-recognitionmethod of the pointer gauge and introduces amethod that uses a coarseto-
fine scheme and has superior performance in the accuracy and stability of its reading identification. First, it uses the region
growing method to locate the dial region and its center. Second, it uses an improved central projection method to determine the
circular scale region under the polar coordinate system and detect the scale marks. Then, the border detection is implemented
in the dial image, and the Hough transform method is used to obtain the pointer direction by means of pointer contour fitting.
Finally, the reading of the gauge is obtained by comparing the location of the pointer with the scalemarks.The experimental results
demonstrate the effectiveness of the proposed approach. This approach is applicable for reading gauges whose scalemarks are either
evenly or unevenly distributed.
Attached files: Machine Vision Based Automatic Detection Method of Indicating Values of a Pointer Gauge.pdf
This paper presents a real-time face detector, named Single Shot Scale-invariant Face Detector (S3FD), which performs superiorly on various scales of faces with a single deep neural network, especially for small faces. Specifically, we try to solve the common problem that anchorbased detectors deteriorate dramatically as the objects become smaller. We make contributions in the following three aspects: 1) proposing a scale-equitable face detection framework to handle different scales of faces well. We tile anchors on a wide range of layers to ensure that all scales of faces have enough features for detection. Besides, we design anchor scales based on the effective receptive field and a proposed equal proportion interval principle; 2) improving the recall rate of small faces by a scale compensation anchor matching strategy; 3) reducing the false positive rate of small faces via a max-out background label. As a consequence, our method achieves state-of-theart detection performance on all the common face detection benchmarks, including the AFW, PASCAL face, FDDB and WIDER FACE datasets, and can run at 36 FPS on a Nvidia Titan X (Pascal) for VGA-resolution images.
Attached files: 1708.05237.pdf
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously
generating a high-quality segmentation mask for each instance.The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small
overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, boundingbox
object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the
COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron.
Attached files: MaskRCNN.pdf
We present an improved three-step pipeline for the stereo matching problem and introduce multiple novelties at each stage. We propose a new highway network architecture for computing the matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained
with a hybrid loss that supports multilevel comparison of image patches. A novel post-processing step is then introduced, which employs a second deep convolutional neural network for pooling global information from multiple disparities. This network outputs both the image disparity
map, which replaces the conventional ?winner takes all?strategy, and a confidence in the prediction. The confidence score is achieved by training the network with a new technique
that we call the reflective loss. Lastly, the learned confidence is employed in order to better detect outliers in the refinement step. The proposed pipeline achieves state of the art accuracy on the largest and most competitive stereo benchmarks, and the learned confidence is shown to outperform all existing alternatives.
We introduce the dense captioning task, which requires a
computer vision system to both localize and describe salient
regions in images in natural language. The dense captioning
task generalizes object detection when the descriptions
consist of a single word, and Image Captioning when one
predicted region covers the full image. To address the localization
and description task jointly we propose a Fully Convolutional
Localization Network (FCLN) architecture that
processes an image with a single, efficient forward pass, requires
no external regions proposals, and can be trained
end-to-end with a single round of optimization. The architecture
is composed of a Convolutional Network, a novel
dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We
evaluate our network on the Visual Genome dataset, which
comprises 94,000 images and 4,100,000 region-grounded
captions. We observe both speed and accuracy improvements
over baselines based on current state of the art approaches
in both generation and retrieval settings
Attached files: DenseCap-Fully Convolutional Localization Networks for Dense Captioning.pdf
Articulated human pose estimation is a fundamental yet challenging task in computer vision. The difficulty is particularly pronounced in scale variations of human body parts when camera view changes or severe foreshortening happens. Although pyramid methods are widely used to handle scale changes at inference time, learning feature pyramids in deep convolutional neural networks (DCNNs) is still not well explored. In this work, we design a Pyramid Residual Module (PRMs) to enhance the invariance in scales of DCNNs. Given input features, the PRMs learn convolutional filters on various scales of input features, which are obtained with different subsampling ratios in a multibranch network. Moreover, we observe that it is inappropriate to adopt existing methods to initialize the weights of multi-branch networks, which achieve superior performance than plain networks in many tasks recently. Therefore, we provide theoretic derivation to extend the current
weight initialization scheme to multi-branch network structures. We investigate our method on two standard benchmarks for human pose estimation. Our approach obtains state-of-the-art results on both benchmarks. Code is available at https://github.com/bearpaw/PyraNet
Attached files: Learning Feature Pyramids for Human Pose Estimation.pdf
We introduce the notion of semantic background subtraction, a novel framework for motion detection in video sequences. The key innovation consists to leverage object-level semantics to address the variety of challenging scenarios for background subtraction. Our framework combines the information of a semantic segmentation algorithm, expressed by a probability for each pixel, with the output of any background subtraction algorithm to reduce false positive detections produced by illumination changes, dynamic backgrounds, strong shadows, and ghosts. In addition, it maintains a fully semantic background model to improve the detection of camouflaged foreground objects. Experiments led on the CDNet dataset show that we managed to improve, significantly, almost all background subtraction algorithms of the CDNet leaderboard, and reduce the mean overall error rate of all the 34 algorithms (resp. of the best 5 algorithms) by roughly 50% (resp. 20%). Note that a C++ implementation of the framework is available at http://www.telecom.ulg.ac.be/semantic.
Attached files: Braham2017Semantic.pdf
Accurate and fast detection of the moving targets from a moving camera are an important yet challenging problem, especially when the computational resources are limited. In this paper, we propose an effective, efﬁcient, and robust method to accurately detect and segment multiple independently moving foreground targets from a video sequence taken by a monocular moving camera [e.g., onboard an unmannedaerial vehicle(UAV)]. Our proposed method advances the existing methods in a number of ways, where: 1) camera motion is estimated through tracking background keypoints using pyramidal Lucas?CKanade at every detection interval, for efﬁciency; 2) foreground segmentation is applied by integrating a local motion history function with spatio-temporal differencing over a sliding window for detecting multiple moving targets, while the perspective homography is used at image registration for effectiveness; and 3) the detection interval is adjusted dynamically based on a rule-of-thumb technique and considering camera setup parameters for robustness. The proposed method has been tested on a variety of scenarios using a UAV camera, as well as publically available data sets. Based on the reported results and through comparison with the existing methods, the accuracy of the proposed method in detecting multiple moving targets as well as its capability for realtime implementation has been successfully demonstrated. Our method is also robustly applicableto ground-level cameras for the ITS applications, as conﬁrmed by the experimental results. More speciﬁcally, the proposed method shows promising performance compared with the literature in terms of quantitative metrics, while the run-time measures are signiﬁcantly improved for realtime implementation.
Attached files: Effective and Efficient Detection of Moving.pdf 20180310-report-Yang Yu.pptx
We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss. Thanks to parameter sharing between child models, ENAS is fast: it delivers strong empirical performances using much fewer GPU-hours than all existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search. On the Penn Treebank dataset, ENAS discovers a novel architecture that achieves a test perplexity of 55.8, establishing a new state-of-the-art among all methods without post-training processing. On the CIFAR-10 dataset, ENAS designs novel architectures that achieve a test error of 2.89%, which is on par with NASNet (Zoph et al., 2018), whose test error is 2.65%.
Attached files: enas.pdf
Abstract. This paper presents an automatic segmentation system for
characters in text color images cropped from natural images or videos
based on a new neuronal architecture insuring fast processing and robustness
against noise, variations in illumination, complex background
and low resolution. An off-line training phase on a set of synthetic text
color images, where the exact character positions are known, allows adjusting
the neural parameters and thus building an optimal non linear
filter which extracts the best features in order to robustly detect the border
positions between characters. The proposed method is tested on a
set of synthetic text images to precisely evaluate its performance according
to noise, and on a set of complex text images collected from video
frames and web pages to evaluate its performance on real images. The
results are encouraging with a good segmentation rate of 89.12% and a
recognition rate of 81.94% on a set of difficult text images collected from
video frames and from web pages.
Attached files: An_Automatic_Method_for_Video_Character_Segmentati.pdf
The ability to identify and temporally segment fine-
grained human actions throughout a video is crucial for
robotics, surveillance, education, and beyond. Typical ap-
proaches decouple this problem by first extracting local
spatiotemporal features from video frames and then feed-
ing them into a temporal classifier that captures high-
level temporal patterns. We describe a class of temporal
models, which we call Temporal Convolutional Networks
(TCNs), that use a hierarchy of temporal convolutions to
perform fine-grained action segmentation or detection. Our
Encoder-Decoder TCN uses pooling and upsampling to ef-
ficiently capture long-range temporal patterns whereas our
Dilated TCN uses dilated convolutions. We show that TCNs
are capable of capturing action compositions, segment du-
rations, and long-range dependencies, and are over a mag-
nitude faster to train than competing LSTM-based Recur-
rent Neural Networks. We apply these models to three chal-
lenging fine-grained datasets and show large improvements
over the state of the art.
Attached files: Feb saturday seminar final.pptx cand1_Lea_Temporal_Convolutional_Networks_CVPR_2017_paper.pdf
We present an accurate stereo matching method using local expansion moves based on graph cuts. This new move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways: localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local -expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions. With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
Attached files: TPAMI-Contiuous 3D Label Stereo Matching using Local Expansion.pdf
Abstract?We present an accurate stereo matching method using local expansion moves based on graph cuts. This new
move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively
combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as
many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways:
localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local
-expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions.
With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using
randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief
propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity
maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even
using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
Attached files: continuous 3D Label Stereo Matching using Local Expansion moves.pdf
Human actions captured in video sequences are three dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but
invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing
the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark
datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of
similar model complexities.
Attached files: Lattice Long Short-Term Memory for Human Action Recognition.pdf
Stereo matching is a challenging problem with respect to weak texture, discontinuities,illumination difference and occlusions. Therefore, a deep learning framework is presented in this paper, which focuses on the rst and last stage of typical stereo methods: the matching cost computation and the
disparity renement. For matching cost computation, two patch-based network architectures are exploited to allow the trade-off between speed and accuracy, both of which leverage multi-size and multi-layer pooling unit with no strides to learn cross-scale feature representations. For disparity renement, unlike traditional handcrafted renement algorithms, we incorporate the initial optimal and sub-optimal disparity maps before outlier detection. Furthermore, diverse base learners are encouraged to focus on specic replacement tasks, corresponding to the smooth regions and details. Experiments on different datasets demonstrate the effectiveness of our approach, which is able to obtain sub-pixel accuracy and restore occlusions to a great extent. Specically, our accurate framework attains near-peak accuracy both in non-occluded and occluded region and our fast framework achieves competitive performance against the fast algorithms on Middlebury benchmark.
A capsule is a group of neurons whose outputs represent different properties of the same entity. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix which could learn to represent the relationship between that entity and the viewer. A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated using the EM algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The whole system is trained discriminatively by unrolling 3 iterations of EM between each pair of adjacent layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attack than our baseline convolutional neural nettwork.
Attached files: MATRIX CAPSULES WITH EM ROUTING.pdf